White Space Handling with the XML Object Model

Sometimes the XML Object Model will show TEXT nodes containing white space characters. This can be confusing when most of the time white space is stripped. For example, the following XML:

<?xml version="1.0" ?>
<!DOCTYPE person [
  <!ELEMENT person (#PCDATA|lastname|firstname)>
  <!ELEMENT lastname (#PCDATA)>
  <!ELEMENT firstname (#PCDATA)>
]>
<person>
  <lastname>Smith</lastname>
  <firstname>John</firstname>
</person>

Generates the following tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
TEXT: 
ELEMENT: lastname
TEXT: 
ELEMENT: firstname
TEXT:

The first name and last name are surrounded by TEXT nodes containing only white space because the content model for the "person" element is MIXED; it contains the #PCDATA keyword. A MIXED content model indicates that the elements can have text interspersed between them. Therefore, the following is also valid:

<person>
My last name is <lastname>Smith</lastname> and my first name is
<firstname>John</firstname>
</person>

And this results in the following, similar-looking tree:

ELEMENT: person
TEXT: My last name is
ELEMENT: lastname
TEXT: and my first name is
ELEMENT: firstname
TEXT:

Without the white space after the word "is" and before <lastname>, and the white space after the </lastname> and before the word "and", that the sentence would be unintelligible. So, for MIXED content models, the combination of text, white space, and elements is relevant. For non-MIXED content models this is not the case.

To make the white-space-only TEXT nodes go away, remove the #PCDATA keyword from the "person" element declaration:

<!ELEMENT person (lastname,firstname)>

which results in the following clean tree:

Processing Instruction: xml
DocType: person
ELEMENT: person
ELEMENT: lastname
ELEMENT: firstname