Class Parser


  • public class Parser
    extends java.lang.Object
    A lossless XML parser that preserves all formatting information including whitespace, comments, attribute quote styles, and entity encoding.

    The Parser class is responsible for converting XML text into DomTrip's internal node tree representation. Unlike traditional XML parsers that normalize content and lose formatting information, this parser meticulously preserves every aspect of the original XML formatting to enable perfect round-trip processing.

    Parsing Features:

    • Whitespace Preservation - Maintains all whitespace exactly as written
    • Automatic Whitespace Normalization - Never creates Text nodes with only whitespace
    • Attribute Formatting - Preserves quote styles, order, and spacing
    • Comment Preservation - Keeps all XML comments in their original positions
    • Entity Preservation - Maintains entity references in their original form
    • Processing Instructions - Preserves PIs including XML declarations
    • CDATA Sections - Maintains CDATA boundaries and content

    Parsing Process:

    The parser uses a stack-based approach to build the XML tree:

    1. Tokenizes the input XML character by character
    2. Identifies XML constructs (elements, comments, text, etc.)
    3. Preserves original formatting information for each construct
    4. Automatically normalizes whitespace-only content to element properties
    5. Builds a complete node tree with parent-child relationships
    6. Maintains modification flags for selective formatting preservation

    Whitespace Normalization:

    The parser automatically normalizes whitespace during parsing to ensure a clean tree structure:

    • No Whitespace-Only Text Nodes - Whitespace between elements is captured in element properties
    • Mixed Content Preservation - Text nodes with actual content preserve their whitespace
    • Lossless Round-Trip - All whitespace is preserved for perfect XML reconstruction
    • Element Properties - Whitespace stored in precedingWhitespace, innerPrecedingWhitespace, etc.

    Error Handling:

    The parser provides detailed error information for malformed XML:

    • Precise error positions within the source text
    • Descriptive error messages for common XML problems
    • Context information to help locate and fix issues

    Usage:

    
     Parser parser = new Parser();
     try {
         // Parse from String
         Document document = parser.parse(xmlString);
    
         // Parse from InputStream with encoding detection
         Document document2 = parser.parse(inputStream);
    
         // Parse from InputStream with fallback encoding
         Document document3 = parser.parse(inputStream, "UTF-8");
    
         // Use the parsed document
     } catch (DomTripException e) {
         // Handle parsing errors
         System.err.println("Parse error at position " + e.position() + ": " + e.getMessage());
     }
     
    See Also:
    Document, Element, DomTripException, Serializer
    • Constructor Summary

      Constructors 
      Constructor Description
      Parser()
      Creates a new Parser instance with default settings.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      Document parse​(java.io.InputStream inputStream)
      Parses XML from an InputStream with automatic encoding detection.
      Document parse​(java.io.InputStream inputStream, java.lang.String defaultEncoding)
      Parses XML from an InputStream with encoding detection and fallback.
      Document parse​(java.io.InputStream inputStream, java.nio.charset.Charset defaultCharset)
      Parses XML from an InputStream with encoding detection and fallback.
      Document parse​(java.lang.String xml)
      Parses an XML string into a lossless XML document tree.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Parser

        public Parser()
        Creates a new Parser instance with default settings.

        No initialization is needed here because the parser state (xml, position, length) is initialized at the start of each parse(String) call.

    • Method Detail

      • parse

        public Document parse​(java.io.InputStream inputStream)
                       throws DomTripException
        Parses XML from an InputStream with automatic encoding detection.

        This method automatically detects the character encoding by:

        1. Checking for a Byte Order Mark (BOM)
        2. Reading the XML declaration to extract the encoding attribute
        3. Falling back to UTF-8 if no encoding is specified

        The resulting Document will have its encoding property set to the detected or declared encoding.

        Parameters:
        inputStream - the InputStream containing XML data
        Returns:
        a Document containing the parsed XML with preserved formatting
        Throws:
        DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
      • parse

        public Document parse​(java.io.InputStream inputStream,
                              java.lang.String defaultEncoding)
                       throws DomTripException
        Parses XML from an InputStream with encoding detection and fallback.

        This method attempts to detect the character encoding by:

        1. Checking for a Byte Order Mark (BOM)
        2. Reading the XML declaration to extract the encoding attribute
        3. Using the provided default encoding if detection fails

        The resulting Document will have its encoding property set to the detected, declared, or default encoding.

        Parameters:
        inputStream - the InputStream containing XML data
        defaultEncoding - the encoding name to use if detection fails
        Returns:
        a Document containing the parsed XML with preserved formatting
        Throws:
        DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
      • parse

        public Document parse​(java.io.InputStream inputStream,
                              java.nio.charset.Charset defaultCharset)
                       throws DomTripException
        Parses XML from an InputStream with encoding detection and fallback.

        This method attempts to detect the character encoding by:

        1. Checking for a Byte Order Mark (BOM)
        2. Reading the XML declaration to extract the encoding attribute
        3. Using the provided default charset if detection fails

        The resulting Document will have its encoding property set to the detected, declared, or default encoding.

        Parameters:
        inputStream - the InputStream containing XML data
        defaultCharset - the charset to use if detection fails
        Returns:
        a Document containing the parsed XML with preserved formatting
        Throws:
        DomTripException - if the XML is malformed, cannot be parsed, or I/O errors occur
      • parse

        public Document parse​(java.lang.String xml)
                       throws DomTripException
        Parses an XML string into a lossless XML document tree.

        This method performs complete XML parsing while preserving all formatting information including whitespace, comments, attribute styles, and entity encoding. The resulting Document can be used for lossless round-trip editing.

        Parameters:
        xml - the XML string to parse
        Returns:
        a Document containing the parsed XML with preserved formatting
        Throws:
        DomTripException - if the XML is malformed or cannot be parsed