C# 5.0 in a Nutshell (2012)
Chapter 11. Other XML Technologies
The System.Xml namespace comprises the following namespaces and core classes:
System.Xml.*
XmlReader and XmlWriter
High-performance, forward-only cursors for reading or writing an XML stream
XmlDocument
Represents an XML document in a W3C-style DOM
System.Xml.XPath
Infrastructure and API (XPathNavigator) for XPath, a string-based language for querying XML
System.Xml.XmlSchema
Infrastructure and API for (W3C) XSD schemas
System.Xml.Xsl
Infrastructure and API (XslCompiledTransform) for performing (W3C) XSLT transformations of XML
System.Xml.Serialization
Supports the serialization of classes to and from XML (see Chapter 17)
System.Xml.XLinq
Modern, simplified, LINQ-centric version of XmlDocument (see Chapter 10)
W3C is an abbreviation for World Wide Web Consortium, where the XML standards are defined.
XmlConvert, the static class for parsing and formatting XML strings, is covered in Chapter 6.
XmlReader
XmlReader is a high-performance class for reading an XML stream in a low-level, forward-only manner.
Consider the following XML file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
To instantiate an XmlReader, you call the static XmlReader.Create method, passing in a Stream, a TextReader, or a URI string. For example:
using (XmlReader reader = XmlReader.Create ("customer.xml"))
...
NOTE
Because XmlReader lets you read from potentially slow sources (Streams and URIs), it offers asynchronous versions of most of its methods so that you can easily write non-blocking code. We’ll cover asynchrony in detail in Chapter 14.
To construct an XmlReader that reads from a string:
XmlReader reader = XmlReader.Create (
new System.IO.StringReader (myString));
You can also pass in an XmlReaderSettings object to control parsing and validation options. The following three properties on XmlReaderSettings are particularly useful for skipping over superfluous content:
bool IgnoreComments // Skip over comment nodes?
bool IgnoreProcessingInstructions // Skip over processing instructions?
bool IgnoreWhitespace // Skip over whitespace?
In the following example, we instruct the reader not to emit whitespace nodes, which are a distraction in typical scenarios:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
using (XmlReader reader = XmlReader.Create ("customer.xml", settings))
...
Another useful property on XmlReaderSettings is ConformanceLevel. Its default value of Document instructs the reader to assume a valid XML document with a single root node. This is a problem if you want to read just an inner portion of XML, containing multiple nodes:
<firstname>Jim</firstname>
<lastname>Bo</lastname>
To read this without throwing an exception, you must set ConformanceLevel to Fragment.
XmlReaderSettings also has a property called CloseInput, which indicates whether to close the underlying stream when the reader is closed (there’s an analogous property on XmlWriterSettings called CloseOutput). The default value for CloseInput and CloseOutput isfalse.
Reading Nodes
The units of an XML stream are XML nodes. The reader traverses the stream in textual (depth-first) order. The Depth property of the reader returns the current depth of the cursor.
The most primitive way to read from an XmlReader is to call Read. It advances to the next node in the XML stream, rather like MoveNext in IEnumerator. The first call to Read positions the cursor at the first node. When Read returns false, it means the cursor has advanced past the last node, at which point the XmlReader should be closed and abandoned.
In this example, we read every node in the XML stream, outputting each node type as we go:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
using (XmlReader reader = XmlReader.Create ("customer.xml", settings))
while (reader.Read())
{
Console.Write (new string (' ',reader.Depth*2)); // Write indentation
Console.WriteLine (reader.NodeType);
}
The output is as follows:
XmlDeclaration
Element
Element
Text
EndElement
Element
Text
EndElement
EndElement
NOTE
Attributes are not included in Read-based traversal (see the section Reading Attributes later in this chapter).
NodeType is of type XmlNodeType, which is an enum with these members:
None XmlDeclaration Element EndElement Text Attribute |
Comment Entity EndEntity EntityReference ProcessingInstruction CDATA |
Document DocumentType DocumentFragment Notation Whitespace SignificantWhitespace |
Two string properties on XmlReader provide access to a node’s content: Name and Value. Depending on the node type, either Name or Value (or both) is populated:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
settings.ProhibitDtd = false; // Must set this to read DTDs
using (XmlReader r = XmlReader.Create ("customer.xml", settings))
while (r.Read())
{
Console.Write (r.NodeType.ToString().PadRight (17, '-'));
Console.Write ("> ".PadRight (r.Depth * 3));
switch (r.NodeType)
{
case XmlNodeType.Element:
case XmlNodeType.EndElement:
Console.WriteLine (r.Name); break;
case XmlNodeType.Text:
case XmlNodeType.CDATA:
case XmlNodeType.Comment:
case XmlNodeType.XmlDeclaration:
Console.WriteLine (r.Value); break;
case XmlNodeType.DocumentType:
Console.WriteLine (r.Name + " - " + r.Value); break;
default: break;
}
}
To demonstrate this, we’ll expand our XML file to include a document type, entity, CDATA, and comment:
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE customer [ <!ENTITY tc "Top Customer"> ]>
<customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
<quote><![CDATA[C#'s operators include: < > &]]></quote>
<notes>Jim Bo is a &tc;</notes>
<!-- That wasn't so bad! -->
</customer>
An entity is like a macro; a CDATA is like a verbatim string (@"...") in C#. Here’s the result:
XmlDeclaration---> version="1.0" encoding="utf-8"
DocumentType-----> customer - <!ENTITY tc "Top Customer">
Element----------> customer
Element----------> firstname
Text-------------> Jim
EndElement-------> firstname
Element----------> lastname
Text-------------> Bo
EndElement-------> lastname
Element----------> quote
CDATA------------> C#'s operators include: < > &
EndElement-------> quote
Element----------> notes
Text-------------> Jim Bo is a Top Customer
EndElement-------> notes
Comment----------> That wasn't so bad!
EndElement-------> customer
XmlReader automatically resolves entities, so in our example, the entity reference &tc; expands into Top Customer.
Reading Elements
Often, you already know the structure of the XML document that you’re reading. To help with this, XmlReader provides a range of methods that read while presuming a particular structure. This simplifies your code, as well as performing some validation at the same time.
NOTE
XmlReader throws an XmlException if any validation fails. XmlException has LineNumber and LinePosition properties indicating where the error occurred—logging this information is essential if the XML file is large!
ReadStartElement verifies that the current NodeType is StartElement, and then calls Read. If you specify a name, it verifies that it matches that of the current element.
ReadEndElement verifies that the current NodeType is EndElement, and then calls Read.
For instance, we could read this:
<firstname>Jim</firstname>
as follows:
reader.ReadStartElement ("firstname");
Console.WriteLine (reader.Value);
reader.Read();
reader.ReadEndElement();
The ReadElementContentAsString method does all of this in one hit. It reads a start element, a text node, and an end element, returning the content as a string:
string firstName = reader.ReadElementContentAsString ("firstname", "");
The second argument refers to the namespace, which is blank in this example. There are also typed versions of this method, such as ReadElementContentAsInt, which parse the result. Returning to our original XML document:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
<creditlimit>500.00</creditlimit> <!-- OK, we sneaked this in! -->
</customer>
We could read it in as follows:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
using (XmlReader r = XmlReader.Create ("customer.xml", settings))
{
r.MoveToContent(); // Skip over the XML declaration
r.ReadStartElement ("customer");
string firstName = r.ReadElementContentAsString ("firstname", "");
string lastName = r.ReadElementContentAsString ("lastname", "");
decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", "");
r.MoveToContent(); // Skip over that pesky comment
r.ReadEndElement(); // Read the closing customer tag
}
NOTE
The MoveToContent method is really useful. It skips over all the fluff: XML declarations, whitespace, comments, and processing instructions. You can also instruct the reader to do most of this automatically through the properties on XmlReaderSettings.
Optional elements
In the previous example, suppose that <lastname> was optional. The solution to this is straightforward:
r.ReadStartElement ("customer");
string firstName = r. ReadElementContentAsString ("firstname", "");
string lastName = r.Name == "lastname"
? r.ReadElementContentAsString() : null;
decimal creditLimit = r.ReadElementContentAsDecimal ("creditlimit", "");
Random element order
The examples in this section rely on elements appearing in the XML file in a set order. If you need to cope with elements appearing in any order, the easiest solution is to read that section of the XML into an X-DOM. We describe how to do this later in the section Patterns for Using XmlReader/XmlWriter.
Empty elements
The way that XmlReader handles empty elements presents a horrible trap. Consider the following element:
<customerList></customerList>
In XML, this is equivalent to:
<customerList/>
And yet, XmlReader treats the two differently. In the first case, the following code works as expected:
reader.ReadStartElement ("customerList");
reader.ReadEndElement();
In the second case, ReadEndElement throws an exception, because there is no separate “end element” as far as XmlReader is concerned. The workaround is to check for an empty element as follows:
bool isEmpty = reader.IsEmptyElement;
reader.ReadStartElement ("customerList");
if (!isEmpty) reader.ReadEndElement();
In reality, this is a nuisance only when the element in question may contain child elements (such as a customer list). With elements that wrap simple text (such as firstname), you can avoid the whole issue by calling a method such as ReadElementContentAsString. TheReadElementXXX methods handle both kinds of empty elements correctly.
Other ReadXXX methods
Table 11-1 summarizes all ReadXXX methods in XmlReader. Most of these are designed to work with elements. The sample XML fragment shown in bold is the section read by the method described.
Table 11-1. Read methods
Members |
Works on NodeType |
Sample XML fragment |
Input parameters |
Data returned |
ReadContentAsXXX |
Text |
<a>x</a> |
x |
|
ReadString |
Text |
<a>x</a> |
x |
|
ReadElementString |
Element |
<a>x</a> |
x |
|
ReadElementContentAsXXX |
Element |
<a>x</a> |
x |
|
ReadInnerXml |
Element |
<a>x</a> |
x |
|
ReadOuterXml |
Element |
<a>x</a> |
<a>x</a> |
|
ReadStartElement |
Element |
<a>x</a> |
||
ReadEndElement |
Element |
<a>x</a> |
||
ReadSubtree |
Element |
<a>x</a> |
<a>x</a> |
|
ReadToDescendent |
Element |
<a>x<b></b></a> |
"b" |
|
ReadToFollowing |
Element |
<a>x<b></b></a> |
"b" |
|
ReadToNextSibling |
Element |
<a>x</a><b></b> |
"b" |
|
ReadAttributeValue |
Attribute |
See Reading Attributes |
The ReadContentAsXXX methods parse a text node into type XXX. Internally, the XmlConvert class performs the string-to-type conversion. The text node can be within an element or an attribute.
The ReadElementContentAsXXX methods are wrappers around corresponding ReadContentAsXXX methods. They apply to the element node, rather than the text node enclosed by the element.
NOTE
The typed ReadXXX methods also include versions that read base 64 and BinHex formatted data into a byte array.
ReadInnerXml is typically applied to an element, and it reads and returns an element and all its descendents. When applied to an attribute, it returns the value of the attribute.
ReadOuterXml is the same as ReadInnerXml, except it includes rather than excludes the element at the cursor position.
ReadSubtree returns a proxy reader that provides a view over just the current element (and its descendents). The proxy reader must be closed before the original reader can be safely read again. At the point the proxy reader is closed, the cursor position of the original reader moves to the end of the subtree.
ReadToDescendent moves the cursor to the start of the first descendent node with the specified name/namespace.
ReadToFollowing moves the cursor to the start of the first node—regardless of depth—with the specified name/namespace.
ReadToNextSibling moves the cursor to the start of the first sibling node with the specified name/namespace.
ReadString and ReadElementString behave like ReadContentAsString and ReadElementContentAsString, except that they throw an exception if there’s more than a single text node within the element. In general, these methods should be avoided because they throw an exception if an element contains a comment.
Reading Attributes
XmlReader provides an indexer giving you direct (random) access to an element’s attributes—by name or position. Using the indexer is equivalent to calling GetAttribute.
Given the following XML fragment:
<customer id="123" status="archived"/>
we could read its attributes as follows:
Console.WriteLine (reader ["id"]); // 123
Console.WriteLine (reader ["status"]); // archived
Console.WriteLine (reader ["bogus"] == null); // True
WARNING
The XmlReader must be positioned on a start element in order to read attributes. After calling ReadStartElement, the attributes are gone forever!
Although attribute order is semantically irrelevant, you can access attributes by their ordinal position. We could rewrite the preceding example as follows:
Console.WriteLine (reader [0]); // 123
Console.WriteLine (reader [1]); // archived
The indexer also lets you specify the attribute’s namespace—if it has one.
AttributeCount returns the number of attributes for the current node.
Attribute nodes
To explicitly traverse attribute nodes, you must make a special diversion from the normal path of just calling Read. A good reason to do so is if you want to parse attribute values into other types, via the ReadContentAsXXX methods.
The diversion must begin from a start element. To make the job easier, the forward-only rule is relaxed during attribute traversal: you can jump to any attribute (forward or backward) by calling MoveToAttribute.
NOTE
MoveToElement returns you to the start element from anyplace within the attribute node diversion.
Returning to our previous example:
<customer id="123" status="archived"/>
we can do this:
reader.MoveToAttribute ("status");
string status = reader.ReadContentAsString();
reader.MoveToAttribute ("id");
int id = reader.ReadContentAsInt();
MoveToAttribute returns false if the specified attribute doesn’t exist.
You can also traverse each attribute in sequence by calling the MoveToFirstAttribute and then the MoveToNextAttribute methods:
if (reader.MoveToFirstAttribute())
do
{
Console.WriteLine (reader.Name + "=" + reader.Value);
}
while (reader.MoveToNextAttribute());
// OUTPUT:
id=123
status=archived
Namespaces and Prefixes
XmlReader provides two parallel systems for referring to element and attribute names:
§ Name
§ NamespaceURI and LocalName
Whenever you read an element’s Name property or call a method that accepts a single name argument, you’re using the first system. This works well if no namespaces or prefixes are present; otherwise, it acts in a crude and literal manner. Namespaces are ignored, and prefixes are included exactly as they were written. For example:
Sample fragment |
Name |
<customer ...> |
customer |
<customer xmlns='blah' ...> |
customer |
<x:customer ...> |
x:customer |
The following code works with the first two cases:
reader.ReadStartElement ("customer");
The following is required to handle the third case:
reader.ReadStartElement ("x:customer");
The second system works through two namespace-aware properties: NamespaceURI and LocalName. These properties take into account prefixes and default namespaces defined by parent elements. Prefixes are automatically expanded. This means that NamespaceURI always reflects the semantically correct namespace for the current element, and LocalName is always free of prefixes.
When you pass two name arguments into a method such as ReadStartElement, you’re using this same system. For example, consider the following XML:
<customer xmlns="DefaultNamespace" xmlns:other="OtherNamespace">
<address>
<other:city>
...
We could read this as follows:
reader.ReadStartElement ("customer", "DefaultNamespace");
reader.ReadStartElement ("address", "DefaultNamespace");
reader.ReadStartElement ("city", "OtherNamespace");
Abstracting away prefixes is usually exactly what you want. If necessary, you can see what prefix was used through the Prefix property and convert it into a namespace by calling LookupNamespace.
XmlWriter
XmlWriter is a forward-only writer of an XML stream. The design of XmlWriter is symmetrical to XmlReader.
As with XmlTextReader, you construct an XmlWriter by calling Create with an optional settings object. In the following example, we enable indenting to make the output more human-readable, and then write a simple XML file:
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
using (XmlWriter writer = XmlWriter.Create ("..\\..\\foo.xml", settings))
{
writer.WriteStartElement ("customer");
writer.WriteElementString ("firstname", "Jim");
writer.WriteElementString ("lastname"," Bo");
writer.WriteEndElement();
}
This produces the following document (the same as the file we read in the first example of XmlReader):
<?xml version="1.0" encoding="utf-8" ?>
<customer>
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
XmlWriter automatically writes the declaration at the top unless you indicate otherwise in XmlWriterSettings, by setting OmitXmlDeclaration to true or ConformanceLevel to Fragment. The latter also permits writing multiple root nodes—something that otherwise throws an exception.
The WriteValue method writes a single text node. It accepts both string and nonstring types such as bool and DateTime, internally calling XmlConvert to perform XML-compliant string conversions:
writer.WriteStartElement ("birthdate");
writer.WriteValue (DateTime.Now);
writer.WriteEndElement();
In contrast, if we call:
WriteElementString ("birthdate", DateTime.Now.ToString());
the result would be both non-XML-compliant and vulnerable to incorrect parsing.
WriteString is equivalent to calling WriteValue with a string. XmlWriter automatically escapes characters that would otherwise be illegal within an attribute or element, such as &, < >, and extended Unicode characters.
Writing Attributes
You can write attributes immediately after writing a start element:
writer.WriteStartElement ("customer");
writer.WriteAttributeString ("id", "1");
writer.WriteAttributeString ("status", "archived");
To write nonstring values, call WriteStartAttribute, WriteValue, and then WriteEndAttribute.
Writing Other Node Types
XmlWriter also defines the following methods for writing other kinds of nodes:
WriteBase64 // for binary data
WriteBinHex // for binary data
WriteCData
WriteComment
WriteDocType
WriteEntityRef
WriteProcessingInstruction
WriteRaw
WriteWhitespace
WriteRaw directly injects a string into the output stream. There is also a WriteNode method that accepts an XmlReader, echoing everything from the given XmlReader.
Namespaces and Prefixes
The overloads for the Write* methods allow you to associate an element or attribute with a namespace. Let’s rewrite the contents of the XML file in our previous example. This time we will associate all the elements with the http://oreilly.com namespace, declaring the prefix o at thecustomer element:
writer.WriteStartElement ("o", "customer", "http://oreilly.com");
writer.WriteElementString ("o", "firstname", "http://oreilly.com", "Jim");
writer.WriteElementString ("o", "lastname", "http://oreilly.com", "Bo");
writer.WriteEndElement();
The output is now as follows:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<o:customer xmlns:o='http://oreilly.com'>
<o:firstname>Jim</o:firstname>
<o:lastname>Bo</o:lastname>
</o:customer>
Notice how for brevity XmlWriter omits the child element’s namespace declarations when they are already declared by the parent element.
Patterns for Using XmlReader/XmlWriter
Working with Hierarchical Data
Consider the following classes:
public class Contacts
{
public IList<Customer> Customers = new List<Customer>();
public IList<Supplier> Suppliers = new List<Supplier>();
}
public class Customer { public string FirstName, LastName; }
public class Supplier { public string Name; }
Suppose you want to use XmlReader and XmlWriter to serialize a Contacts object to XML as in the following:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<contacts>
<customer id="1">
<firstname>Jay</firstname>
<lastname>Dee</lastname>
</customer>
<customer> <!-- we'll assume id is optional -->
<firstname>Kay</firstname>
<lastname>Gee</lastname>
</customer>
<supplier>
<name>X Technologies Ltd</name>
</supplier>
</contacts>
The best approach is not to write one big method, but to encapsulate XML functionality in the Customer and Supplier types themselves by writing ReadXml and WriteXml methods on these types. The pattern in doing so is straightforward:
§ ReadXml and WriteXml leave the reader/writer at the same depth when they exit.
§ ReadXml reads the outer element, whereas WriteXml writes only its inner content.
Here’s how we would write the Customer type:
public class Customer
{
public const string XmlName = "customer";
public int? ID;
public string FirstName, LastName;
public Customer () { }
public Customer (XmlReader r) { ReadXml (r); }
public void ReadXml (XmlReader r)
{
if (r.MoveToAttribute ("id")) ID = r.ReadContentAsInt();
r.ReadStartElement();
FirstName = r.ReadElementContentAsString ("firstname", "");
LastName = r.ReadElementContentAsString ("lastname", "");
r.ReadEndElement();
}
public void WriteXml (XmlWriter w)
{
if (ID.HasValue) w.WriteAttributeString ("id", "", ID.ToString());
w.WriteElementString ("firstname", FirstName);
w.WriteElementString ("lastname", LastName);
}
}
Notice that ReadXml reads the outer start and end element nodes. If its caller did this job instead, Customer couldn’t read its own attributes. The reason for not making WriteXml symmetrical in this regard is twofold:
§ The caller might need to choose how the outer element is named.
§ The caller might need to write extra XML attributes, such as the element’s subtype (which could then be used to decide which class to instantiate when reading back the element).
Another benefit of following this pattern is that it makes your implementation compatible with IXmlSerializable (see Chapter 17).
The Supplier class is analogous:
public class Supplier
{
public const string XmlName = "supplier";
public string Name;
public Supplier () { }
public Supplier (XmlReader r) { ReadXml (r); }
public void ReadXml (XmlReader r)
{
r.ReadStartElement();
Name = r.ReadElementContentAsString ("name", "");
r.ReadEndElement();
}
public void WriteXml (XmlWriter w)
{
w.WriteElementString ("name", Name);
}
}
With the Contacts class, we must enumerate the customers element in ReadXml, checking whether each subelement is a customer or a supplier. We also have to code around the empty element trap:
public void ReadXml (XmlReader r)
{
bool isEmpty = r.IsEmptyElement; // This ensures we don't get
r.ReadStartElement(); // snookered by an empty
if (isEmpty) return; // <contacts/> element!
while (r.NodeType == XmlNodeType.Element)
{
if (r.Name == Customer.XmlName) Customers.Add (new Customer (r));
else if (r.Name == Supplier.XmlName) Suppliers.Add (new Supplier (r));
else
throw new XmlException ("Unexpected node: " + r.Name);
}
r.ReadEndElement();
}
public void WriteXml (XmlWriter w)
{
foreach (Customer c in Customers)
{
w.WriteStartElement (Customer.XmlName);
c.WriteXml (w);
w.WriteEndElement();
}
foreach (Supplier s in Suppliers)
{
w.WriteStartElement (Supplier.XmlName);
s.WriteXml (w);
w.WriteEndElement();
}
}
Mixing XmlReader/XmlWriter with an X-DOM
You can fly in an X-DOM at any point in the XML tree where XmlReader or XmlWriter becomes too cumbersome. Using the X-DOM to handle inner elements is an excellent way to combine X-DOM’s ease of use with the low-memory footprint of XmlReader and XmlWriter.
Using XmlReader with XElement
To read the current element into an X-DOM, you call XNode.ReadFrom, passing in the XmlReader. Unlike XElement.Load, this method is not “greedy” in that it doesn’t expect to see a whole document. Instead, it reads just the end of the current subtree.
For instance, suppose we have an XML logfile structured as follows:
<log>
<logentry id="1">
<date>...</date>
<source>...</source>
...
</logentry>
...
</log>
If there were a million logentry elements, reading the whole thing into an X-DOM would waste memory. A better solution is to traverse each logentry with an XmlReader, and then use XElement to process the elements individually:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreWhitespace = true;
using (XmlReader r = XmlReader.Create ("logfile.xml", settings))
{
r.ReadStartElement ("log");
while (r.Name == "logentry")
{
XElement logEntry = (XElement) XNode.ReadFrom (r);
int id = (int) logEntry.Attribute ("id");
DateTime date = (DateTime) logEntry.Element ("date");
string source = (string) logEntry.Element ("source");
...
}
r.ReadEndElement();
}
If you follow the pattern described in the previous section, you can slot an XElement into a custom type’s ReadXml or WriteXml method without the caller ever knowing you’ve cheated! For instance, we could rewrite Customer’s ReadXml method as follows:
public void ReadXml (XmlReader r)
{
XElement x = (XElement) XNode.ReadFrom (r);
FirstName = (string) x.Element ("firstname");
LastName = (string) x.Element ("lastname");
}
XElement collaborates with XmlReader to ensure that namespaces are kept intact and prefixes are properly expanded—even if defined at an outer level. So, if our XML file read like this:
<log xmlns="http://loggingspace">
<logentry id="1">
...
the XElements we constructed at the logentry level would correctly inherit the outer namespace.
Using XmlWriter with XElement
You can use an XElement just to write inner elements to an XmlWriter. The following code writes a million logentry elements to an XML file using XElement—without storing the whole thing in memory:
using (XmlWriter w = XmlWriter.Create ("log.xml"))
{
w.WriteStartElement ("log");
for (int i = 0; i < 1000000; i++)
{
XElement e = new XElement ("logentry",
new XAttribute ("id", i),
new XElement ("date", DateTime.Today.AddDays (-1)),
new XElement ("source", "test"));
e.WriteTo (w);
}
w.WriteEndElement ();
}
Using an XElement incurs minimal execution overhead. If we amend this example to use XmlWriter throughout, there’s no measurable difference in execution time.
XmlDocument
XmlDocument is an in-memory representation of an XML document, which has since been superseded by the LINQ-to-XML DOM. XmlDocument’s object model and the methods that its types expose conform to a pattern defined by the W3C. So, if you’re familiar with another W3C-compliant XML DOM (e.g., in Java), you’ll be at home with XmlDocument. When compared to the more modern LINQ-to-XML DOM, however, the W3C model is much clumsier to use.
NOTE
XmlDocument is unavailable in the Metro profile. However, a similar DOM is exposed in the WinRT namespace Windows.Data.Xml.Dom.
The base type for all objects in an XmlDocument tree is XmlNode. The following types derive from XmlNode:
XmlNode
XmlDocument
XmlDocumentFragment
XmlEntity
XmlNotation
XmlLinkedNode
XmlLinkedNode exposes NextSibling and PreviousSibling properties and is an abstract base for the following subtypes:
XmlLinkedNode
XmlCharacterData
XmlDeclaration
XmlDocumentType
XmlElement
XmlEntityReference
XmlProcesingInstruction
Loading and Saving an XmlDocument
To load an XmlDocument from an existing source, you instantiate an XmlDocument and then call Load or LoadXml:
§ Load accepts a filename, Stream, TextReader, or XmlReader.
§ LoadXml accepts a literal XML string.
To save a document, call Save with a filename, Stream, TextWriter, or XmlWriter:
XmlDocument doc = new XmlDocument();
doc.Load ("customer1.xml");
doc.Save ("customer2.xml");
Traversing an XmlDocument
To illustrate traversing an XmlDocument, we’ll use the following XML file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
The ChildNodes property (defined in XNode) allows you to descend into the tree structure. This returns an indexable collection:
XmlDocument doc = new XmlDocument();
doc.Load ("customer.xml");
Console.WriteLine (doc.DocumentElement.ChildNodes[0].InnerText); // Jim
Console.WriteLine (doc.DocumentElement.ChildNodes[1].InnerText); // Bo
With the ParentNode property, you can ascend back up the tree:
Console.WriteLine (
doc.DocumentElement.ChildNodes[1].ParentNode.Name); // customer
The following properties also help traverse the document (all of which return null if the node does not exist):
FirstChild |
LastChild |
NextSibling |
PreviousSibling |
The following two statements both output firstname:
Console.WriteLine (doc.DocumentElement.FirstChild.Name);
Console.WriteLine (doc.DocumentElement.LastChild.PreviousSibling.Name);
XmlNode exposes an Attributes property for accessing attributes either by name (and namespace) or by ordinal position. For example:
Console.WriteLine (doc.DocumentElement.Attributes ["id"].Value);
InnerText and InnerXml
The InnerText property represents the concatenation of all child text nodes. The following two lines both output Jim, since our XML document contains only a single text node:
Console.WriteLine (doc.DocumentElement.ChildNodes[0].InnerText);
Console.WriteLine (doc.DocumentElement.ChildNodes[0].FirstChild.Value);
Setting the InnerText property replaces all child nodes with a single text node. Be careful when setting InnerText to not accidentally wipe over element nodes. For example:
doc.DocumentElement.ChildNodes[0].InnerText = "Jo"; // wrong
doc.DocumentElement.ChildNodes[0].FirstChild.InnerText = "Jo"; // right
The InnerXml property represents the XML fragment within the current node. You typically use InnerXml on elements:
Console.WriteLine (doc.DocumentElement.InnerXml);
// OUTPUT:
<firstname>Jim</firstname><lastname>Bo</lastname>
InnerXml throws an exception if the node type cannot have children.
Creating and Manipulating Nodes
To create and add new nodes:
1. Call one of the CreateXXX methods on the XmlDocument, such as CreateElement.
2. Add the new node into the tree by calling AppendChild, PrependChild, InsertBefore, or InsertAfter on the desired parent node.
NOTE
Creating nodes requires that you first have an XmlDocument—you cannot simply instantiate an XmlElement on its own like with the X-DOM. Nodes rely on a host XmlDocument for sustenance.
For example:
XmlDocument doc = new XmlDocument();
XmlElement customer = doc.CreateElement ("customer");
doc.AppendChild (customer);
The following creates a document matching the XML we started with earlier in this chapter in the section XmlReader:
XmlDocument doc = new XmlDocument ();
doc.AppendChild (doc.CreateXmlDeclaration ("1.0", null, "yes"));
XmlAttribute id = doc.CreateAttribute ("id");
XmlAttribute status = doc.CreateAttribute ("status");
id.Value = "123";
status.Value = "archived";
XmlElement firstname = doc.CreateElement ("firstname");
XmlElement lastname = doc.CreateElement ("lastname");
firstname.AppendChild (doc.CreateTextNode ("Jim"));
lastname.AppendChild (doc.CreateTextNode ("Bo"));
XmlElement customer = doc.CreateElement ("customer");
customer.Attributes.Append (id);
customer.Attributes.Append (status);
customer.AppendChild (lastname);
customer.AppendChild (firstname);
doc.AppendChild (customer);
You can construct the tree in any order. In the previous example, it doesn’t matter if you rearrange the order of the lines that append child nodes.
To remove a node, you call RemoveChild, ReplaceChild, or RemoveAll.
Namespaces
NOTE
See Chapter 10 for an introduction to XML namespaces and prefixes.
The CreateElement and CreateAttribute methods are overloaded to let you specify a namespace and prefix:
CreateXXX (string name);
CreateXXX (string name, string namespaceURI);
CreateXXX (string prefix, string localName, string namespaceURI);
The name parameter refers to either a local name (i.e., no prefix) or a name qualified with a prefix. The namespaceURI parameter is used if and only if you are declaring (rather than merely referring to) a namespace.
Here is an example of declaring a namespace with a prefix while creating an element:
XmlElement customer = doc.CreateElement ("o", "customer",
"http://oreilly.com");
Here is an example of referring to a namespace with a prefix while creating an element:
XmlElement customer = doc.CreateElement ("o:firstname");
In the next section, we will explain how to deal with namespaces when writing XPath queries.
XPath
XPath is the W3C standard for XML querying. In the .NET Framework, XPath can query an XmlDocument rather like LINQ queries an X-DOM. XPath has a wider scope, though, in that it’s also used by other XML technologies, such as XML schema, XLST, and XAML. XPath is unavailable in the Metro .NET profile.
NOTE
XPath queries are expressed in terms of the XPath 2.0 Data Model. Both the DOM and the XPath Data Model represent an XML document as a tree. The difference is that the XPath Data Model is purely data-centric, abstracting away the formatting aspects of XML text. For example, CDATA sections are not required in the XPath Data Model, since the only reason CDATA sections exist is to enable text to contain markup character sequences. The XPath specification is at http://www.w3.org/tr/xpath20/.
The examples in this section all use the following XML file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<customers>
<customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
<customer>
<firstname>Thomas</firstname>
<lastname>Jefferson</lastname>
</customer>
</customers>
You can write XPath queries within code in the following ways:
§ Call one of the SelectXXX methods on an XmlDocument or XmlNode.
§ Spawn an XPathNavigator from either:
§ An XmlDocument
§ An XPathDocument
§ Call an XPathXXX extension method on an XNode.
The SelectXXX methods accept an XPath query string. For example, the following finds the firstname node of an XmlDocument:
XmlDocument doc = new XmlDocument();
doc.Load ("customers.xml");
XmlNode n = doc.SelectSingleNode ("customers/customer[firstname='Jim']");
Console.WriteLine (n.InnerText); // JimBo
The SelectXXX methods delegate their implementation to XPathNavigator, which you can also use directly—over either an XmlDocument or a read-only XPathDocument.
You can also execute XPath queries over an X-DOM, via extension methods defined in System.Xml.XPath:
XDocument doc = XDocument.Load (@"Customers.xml");
XElement e = e.XPathSelectElement ("customers/customer[firstname='Jim']");
Console.WriteLine (e.Value); // JimBo
The extension methods for use with XNodes are:
CreateNavigator |
XPathEvaluate |
XPathSelectElement |
XPathSelectElements |
Common XPath Operators
The XPath specification is huge. However, you can get by knowing just a few operators (see Table 11-2), just as you can play a lot of songs knowing just three chords.
Table 11-2. Common XPath operators
Operator |
Description |
/ |
Children |
// |
Recursively children |
. |
Current node (usually implied) |
.. |
Parent node |
* |
Wildcard |
@ |
Attribute |
[] |
Filter |
: |
Namespace separator |
To find the customers node:
XmlNode node = doc.SelectSingleNode ("customers");
The / symbol queries child nodes. To select the customer nodes:
XmlNode node = doc.SelectSingleNode ("customers/customer");
The // operator includes all child nodes, regardless of nesting level. To select all lastname nodes:
XmlNodeList nodes = doc.SelectNodes ("//lastname");
The .. operator selects parent nodes. This example is a little silly because we’re starting from the root anyway, but it serves to illustrate the functionality:
XmlNodeList nodes = doc.SelectNodes ("customers/customer..customers");
The * operator selects nodes regardless of name. The following selects the child nodes of customer, regardless of name:
XmlNodeList nodes = doc.SelectNodes ("customers/customer/*");
The @ operator selects attributes. * can be used as a wildcard. Here is how to select the id attribute:
XmlNode node = doc.SelectSingleNode ("customers/customer/@id");
The [] operator filters a selection, in conjunction with the operators =, !=, <, >, not(), and, and or. In this example, we filter on firstname:
XmlNode n = doc.SelectSingleNode ("customers/customer[firstname='Jim']");
The : operator qualifies a namespace. Had the customers element been qualified with the x namespace, we would access it as follows:
XmlNode node = doc.SelectSingleNode ("x:customers");
XPathNavigator
XPathNavigator is a cursor over the XPath Data Model representation of an XML document. It is loaded with primitive methods that move the cursor around the tree (e.g., move to parent, move to first child, etc.). The XPathNavigator’s Select* methods take an XPath string to express more complex navigations or queries that return multiple nodes.
Spawn instances of XPathNavigator from an XmlDocument, an XPathDocument, or another XPathNavigator. Here is an example of spawning an XPathNavigator from an XmlDoument:
XPathNavigator nav = doc.CreateNavigator();
XPathNavigator jim = nav.SelectSingleNode
(
"customers/customer[firstname='Jim']"
);
Console.WriteLine (jim.Value); // JimBo
In the XPath Data Model, the value of a node is the concatenation of the text elements, equivalent to XmlDocument’s InnerText property.
The SelectSingleNode method returns a single XPathNavigator. The Select method returns an XPathNodeIterator, which simply iterates over multiple XPathNavigators. For example:
XPathNavigator nav = doc.CreateNavigator();
string xPath = "customers/customer/firstname/text()";
foreach (XPathNavigator navC in nav.Select (xPath))
Console.WriteLine (navC.Value);
OUTPUT:
Jim
Thomas
To perform faster queries, you can compile an XPath query into an XPathExpression. You then pass the compiled expression to a Select* method, instead of a string. For example:
XPathNavigator nav = doc.CreateNavigator();
XPathExpression expr = nav.Compile ("customers/customer/firstname");
foreach (XPathNavigator a in nav.Select (expr))
Console.WriteLine (a.Value);
OUTPUT:
Jim
Thomas
Querying with Namespaces
Querying elements and attributes that contain namespaces requires some extra unintuitive steps. Consider the following XML file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<o:customers xmlns:o='http://oreilly.com'>
<o:customer id="123" status="archived">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</o:customer>
<o:customer>
<firstname>Thomas</firstname>
<lastname>Jefferson</lastname>
</o:customer>
</o:customers>
The following query will fail, despite qualifying the nodes with the prefix o:
XmlDocument doc = new XmlDocument();
doc.Load ("customers.xml");
XmlNode n = doc.SelectSingleNode ("o:customers/o:customer");
Console.WriteLine (n.InnerText); // JimBo
To make this query work, you must first create an XmlNamespaceManager instance as follows:
XmlNamespaceManager xnm = new XmlNamespaceManager (doc.NameTable);
You can treat NameTable as a black box (XmlNamespaceManager uses it internally to cache and reuse strings). Once we create the namespace manager, we can add prefix/namespace pairs to it as follows:
xnm.AddNamespace ("o", "http://oreilly.com");
The Select* methods on XmlDocument and XPathNavigator have overloads that accept an XmlNamespaceManager. We can successfully rewrite the previous query as follows:
XmlNode n = doc.SelectSingleNode ("o:customers/o:customer", xnm);
XPathDocument
XPathDocument is used for read-only XML documents that conform to the W3C XPath Data Model. An XPathNavigator backed by an XPathDocument is faster than an XmlDocument, but it cannot make changes to the underlying document:
XPathDocument doc = new XPathDocument ("customers.xml");
XPathNavigator nav = doc.CreateNavigator();
foreach (XPathNavigator a in nav.Select ("customers/customer/firstname"))
Console.WriteLine (a.Value);
OUTPUT:
Jim
Thomas
XSD and Schema Validation
The content of a particular XML document is nearly always domain-specific, such as a Microsoft Word document, an application configuration document, or a web service. For each domain, the XML file conforms to a particular pattern. There are several standards for describing the schema of such a pattern, to standardize and automate the interpretation and validation of XML documents. The most widely accepted standard is XSD, short for XML Schema Definition. Its precursors, DTD and XDR, are also supported by System.Xml.
Consider the following XML document:
<?xml version="1.0"?>
<customers>
<customer id="1" status="active">
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
<customer id="1" status="archived">
<firstname>Thomas</firstname>
<lastname>Jefferson</lastname>
</customer>
</customers>
We can write an XSD for this document as follows:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified"
elementFormDefault="qualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="customers">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="customer">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string" />
<xs:element name="lastname" type="xs:string" />
</xs:sequence>
<xs:attribute name="id" type="xs:int" use="required" />
<xs:attribute name="status" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
As you can see, XSD documents are themselves written in XML. Furthermore, an XSD document is describable with XSD—you can find that definition at http://www.w3.org/2001/xmlschema.xsd.
Performing Schema Validation
You can validate an XML file or document against one or more schemas before reading or processing it. There are a number of reasons to do so:
§ You can get away with less error checking and exception handling.
§ Schema validation picks up errors you might otherwise overlook.
§ Error messages are detailed and informative.
To perform validation, plug a schema into an XmlReader, an XmlDocument, or an X-DOM object, and then read or load the XML as you would normally. Schema validation happens automatically as content is read, so the input stream is not read twice.
Validating with an XmlReader
Here’s how to plug a schema from the file customers.xsd into an XmlReader:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
...
If the schema is inline, set the following flag instead of adding to Schemas:
settings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
You then Read as you would normally. If schema validation fails at any point, an XmlSchemaValidationException is thrown.
NOTE
Calling Read on its own validates both elements and attributes: you don’t need to navigate to each individual attribute for it to be validated.
If you want only to validate the document, you can do this:
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
try { while (r.Read()) ; }
catch (XmlSchemaValidationException ex)
{
...
}
XmlSchemaValidationException has properties for the error Message, LineNumber, and LinePosition. In this case, it only tells you about the first error in the document. If you want to report on all errors in the document, you instead must handle the ValidationEventHandlerevent:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");
settings.ValidationEventHandler += ValidationHandler;
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
while (r.Read()) ;
When you handle this event, schema errors no longer throw exceptions. Instead, they fire your event handler:
static void ValidationHandler (object sender, ValidationEventArgs e)
{
Console.WriteLine ("Error: " + e.Exception.Message);
}
The Exception property of ValidationEventArgs contains the XmlSchemaValidationException that would have otherwise been thrown.
NOTE
The System.Xml namespace also contains a class called XmlValidatingReader. This was used to perform schema validation prior to Framework 2.0, and it is now deprecated.
Validating an X-DOM or XmlDocument
To validate an XML file or stream while reading into an X-DOM or XmlDocument, you create an XmlReader, plug in the schemas, and then use the reader to load the DOM:
XmlReaderSettings settings = new XmlReaderSettings();
settings.ValidationType = ValidationType.Schema;
settings.Schemas.Add (null, "customers.xsd");
XDocument doc;
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
try { doc = XDocument.Load (r); }
catch (XmlSchemaValidationException ex) { ... }
XmlDocument xmlDoc = new XmlDocument();
using (XmlReader r = XmlReader.Create ("customers.xml", settings))
try { xmlDoc.Load (r); }
catch (XmlSchemaValidationException ex) { ... }
You can also validate an XDocument or XElement that’s already in memory, by calling extension methods in System.Xml.Schema. These methods accept an XmlSchemaSet (a collection of schemas) and a validation event handler:
XDocument doc = XDocument.Load (@"customers.xml");
XmlSchemaSet set = new XmlSchemaSet ();
set.Add (null, @"customers.xsd");
StringBuilder errors = new StringBuilder ();
doc.Validate (set, (sender, args) => { errors.AppendLine
(args.Exception.Message); }
);
Console.WriteLine (errors.ToString());
To validate an XmlDocument already in memory, add the schema(s) to the XmlDocument’s Schemas collection and then call the document’s Validate method, passing in a ValidationEventHandler to process the errors.
XSLT
XSLT stands for Extensible Stylesheet Language Transformations. It is an XML language that describes how to transform one XML language into another. The quintessential example of such a transformation is transforming an XML document (that typically describes data) into an XHTML document (that describes a formatted document).
Consider the following XML file:
<customer>
<firstname>Jim</firstname>
<lastname>Bo</lastname>
</customer>
The following XSLT file describes such a transformation:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="/">
<html>
<p><xsl:value-of select="//firstname"/></p>
<p><xsl:value-of select="//lastname"/></p>
</html>
</xsl:template>
</xsl:stylesheet>
The output is as follows:
<html>
<p>Jim</p>
<p>Bo</p>
</html>
The System.Xml.Xsl.XslCompiledTransform transform class efficiently performs XLST transforms. It renders XmlTransform obsolete. XmlTransform works very simply:
XslCompiledTransform transform = new XslCompiledTransform();
transform.Load ("test.xslt");
transform.Transform ("input.xml", "output.xml");
Generally, it’s more useful to use the overload of Transform that accepts an XmlWriter rather than an output file, so you can control the formatting.