Package Torello.HTML

Class Surrounding


  • public class Surrounding
    extends java.lang.Object
    Class for finding ancestor & parent nodes of any selected HTMLNode.

    Substitute for the DOM-Tree concepts of 'parent' and 'ancestor'


    Class 'Surrounding' is intended to function in place of the Java-Script DOM Tree (Document Object Model Tree) concept known as "Parent" or "Ancestor". Generally, thinking about documents as trees perhaps makes some parts of java-script a little easier to program, unfortunately, for any content-based page - it is much more consistent with "the intentions" of the author or writer of the page to think of HTML as a list (TextNode's & TagNode's etc...). The name "HTML" means Hyper-Text Markup-Language, meaning that text documents are just "Marked" by HTML Elements, so using Java Vector's (instead of DOM Trees) is the guiding philosophy.

    There are other HTML Parsers which build DOM Trees, and generally, those parser are quick to modify the HTML being parsed if any unmatched closing-tags ("Elements") are found. Instead, here the philosophy is that the HTML is presumed valid, and if an unmatched closing HTML Tags are present, an 'Inclusive' search would simply not produce the expected result. Since the vast majority of uses for this package would be scraping news & informational sites - all of which have automatically generated HTML - worrying about unclosed HTML is best left for the "Browser Pioneers" who write the rendering functions for web-pages, and mostly ignoring the concept here.



    The following example will demonstrate how finding the parent and ancestor nodes at a particular index. This example parses one of the documentation pages found on the JavaDocs for this package. It then picks a particular TextNode instance, and asks for all of the HTML Elements whose opening and closing tags "enclose" the TextNode

    Example:
    // Load the documentation html page into vectorized-html
    StringBuffer     sb      = new StringBuffer();
    URL              url     = new URL("http://developer.torello.directory/JavaHTML/Version%201/1.4/javadoc/Torello/HTML/NodeSearch/CommentNodeCount.html");
    Vector<HTMLNode> page    = HTMLPage.getPageTokens(url, false);
    
    // Obtain a vector-index pointer to the text-node containing the indicated string:
    // "a count of how many"
    // This is a line of text from the JavaDoc HTML Page that was loaded above.
    int pos = TextNodeFind.first(page, TextComparitor.CN_CI, "a count of how many");
    
    // Print the output found above to a StringBuffer
    sb.append("Text Node Found: [" + page.elementAt(pos) + "]\n");
    
    // Find the first "ancestor node" or "parent node" of this TextNode
    // Restrict the search to leave out: <LI>, <BODY> or <DIV>
    DotPair dp = Surrounding.firstExcept(page, pos, "li", "body", "div");
    
    // Print the output of this search to the StringBuffer / Log
    sb.append("Index Found: " + pos + ", DotPair Found: " + dp.toString() + "\n");
    sb.append(Debug.printJ(page, dp) + "\n");
    
    // Now print all "ancestor nodes" (Surrounding nodes) - leave out <BODY>, <HTML> and <DIV>
    // ancestors
    Vector<DotPair> allDP = Surrounding.allExcept(page, pos, "body", "html", "div");
    
    for (DotPair l : allDP)
    
        sb.append(
            C.BCYAN + 
            "************************************************************\n" +
            "************************************************************\n" + C.RESET +
            "Index Found: " + pos + ", DotPair Found: " + l.toString() + "\n" +
            "Starting Node: " + C.BRED + page.elementAt(l.start).str + C.RESET + "\n" +
            "Ending Node:" + C.BRED + page.elementAt(l.end).str + C.RESET + "\n"
        );
    
    // Print the StringBuffer / Log to Standard-Out, and to the text-file "out.html"
    String s = sb.toString();
    System.out.println(s);
    
    // NOTE: The above "Printing" uses the Shell.C class (which are UNIX Color-Codes)
    //       This converts those color-codes to HTML <SPAN>...</SPAN> Elements
    FileRW.writeFile(C.toHTML(s.replace("<", "&amp;lt;").replace(">", "&amp;gt;")), "out.html");
    


    The above example would print these results to a UNIX terminal:


    Text Node Found: [ returns a count of how many TextNode's were identified on the vectorized-page parameter
    ]
    Index Found: 698, DotPair Found: [690, 705]
    [<ol>][
    ][<li>][<b>][<code>][int][</code>][</b>][ returns a count of how many TextNode's were identified on the vectorized-page parameter
    ][<code>]['html'][</code>][ that contained text that matched the specified criteria][</li>][
    ][</ol>]
    ************************************************************
    ************************************************************
    Index Found: 698, DotPair Found: [692, 703]
    Starting Node: <li>
    Ending Node:</li>
    ************************************************************
    ************************************************************
    Index Found: 698, DotPair Found: [690, 705]
    Starting Node: <ol>
    Ending Node:</ol>
    ************************************************************
    ************************************************************
    Index Found: 698, DotPair Found: [284, 817]
    Starting Node: <ul class="blockList">
    Ending Node:</ul>




    Stateless Class:
    This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.

    • 1 Constructor(s), 1 declared private, zero-argument constructor
    • 7 Method(s), 7 declared static
    • 0 Field(s)


    • Method Summary

       
      First Ancestor
      Modifier and Type Method
      static DotPair first​(Vector<? extends HTMLNode> html, int index, String... htmlTags)
      static DotPair firstExcept​(Vector<? extends HTMLNode> html, int index, String... htmlTags)
       
      All Ancestors
      Modifier and Type Method
      static Vector<DotPair> all​(Vector<? extends HTMLNode> html, int index, String... htmlTags)
      static Vector<DotPair> allExcept​(Vector<? extends HTMLNode> html, int index, String... htmlTags)
       
      Protected, Internal Methods
      Modifier and Type Method
      protected static Vector<DotPair> ALL​(Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
      protected static DotPair FIRST​(Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • first

        🡇     🗕  🗗  🗖
        public static DotPair first​(java.util.Vector<? extends HTMLNode> html,
                                    int index,
                                    java.lang.String... htmlTags)
        This will return the first ancestor node - along with it's closing element - as a DotPair - that matches.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is the index of the node for whose "ancestors" we are searching (to use a Java-Script DOM Tree term).
        htmlTags - If this list is empty, we shall look for any ancestor node. Since this method returns the first, if this list is left empty, and the index-node is surrounded by even a bold "<B>...</B>" then that will be the DotPair result that is returned. If this list is left non-empty, then the only ancestor nodes whose HTML Element Tag (usually referred to as "the Element") matches a tag from this list shall be returned.

        FOR INSTANCE: If "div", "p", and "a" were provided as values to this parameter - the search loop would skip over all ancestors that were not HTML divider, paragraph or anchor elements before selecting a result.
        Returns:
        This shall return the first sub-list, as a 'DotPair' (start & end index pair). If no matches are found, null will return. This sublist is nearly identical to the Java-Script DOM Tree concept of ancestor-node, though no trees are constructed by this method.
        Throws:
        java.lang.ArrayIndexOutOfBoundsException - If index is not within the bounds of the passed vectorized-html parameter 'html'
        HTMLTokException - If any of the tags passed are null, or not found in the table of class HTMLTags - specifically if they are not valid HTML Elements.
        See Also:
        FIRST(Vector, int, HTMLTagCounter), ARGCHECK.index(Vector, int)
        Code:
        Exact Method Body:
         return FIRST(
             html, ARGCHECK.index(html, index),
             new HTMLTagCounter(htmlTags, HTMLTagCounter.NORMAL, HTMLTagCounter.FIRST)
         );
        
      • firstExcept

        🡅  🡇     🗕  🗗  🗖
        public static DotPair firstExcept​
                    (java.util.Vector<? extends HTMLNode> html,
                     int index,
                     java.lang.String... htmlTags)
        
        This will return the first ancestor node - along with it's closing element - as a DotPair - that matches the input-parameter 'htmlTags' In this case, the term 'except' shall mean that any matches whose HTML Token is among the list in parameter String... htmlTags will be skipped, and a "higher-level" ancestor will be returned instead.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is the index of the node for whose "ancestors" we are searching (to use a Java-Script DOM Tree term).
        htmlTags - When this list is non-empty (contains at least one token), the search loop will skip over ancestor nodes that are among the members of this var-args parameter list. If this method is invoked and this parameter is an empty list, then the search loop will return the first anestor node identified.

        FOR INSTANCE: If "B" and "P" were passed as parameters to this method, then the search-loop will continue looking for higher-level ancestors - until one was found that was not an HTML 'bold' or 'paragraph' element DotPair.
        Returns:
        This shall return the first sub-list, as a 'DotPair' (start & end index pair). If no matches are found, null will return. This sublist is nearly identical to the Java-Script DOM Tree concept of ancestor-node, though no trees are constructed by this method.
        Throws:
        java.lang.ArrayIndexOutOfBoundsException - If index is not within the bounds of the passed vectorized-html parameter 'html'
        HTMLTokException - If any of the tags passed are null, or not found in the table of class HTMLTags - specifically if they are not valid HTML Elements.
        See Also:
        FIRST(Vector, int, HTMLTagCounter), ARGCHECK.index(Vector, int)
        Code:
        Exact Method Body:
         return FIRST(
             html, ARGCHECK.index(html, index),
             new HTMLTagCounter(htmlTags, HTMLTagCounter.EXCEPT, HTMLTagCounter.FIRST)
         );
        
      • all

        🡅  🡇     🗕  🗗  🗖
        public static java.util.Vector<DotPairall​
                    (java.util.Vector<? extends HTMLNode> html,
                     int index,
                     java.lang.String... htmlTags)
        
        This will find all ancestors of a given index. If parameter String... htmlTags is null, all HTML elements will be considered. If this parameter contains any elements, then only those elements shall be considered as match in the ancestor hierarchy tree.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is the index of the node for whose "ancestors" we are searching (to use a Java-Script DOM Tree term).
        htmlTags - If this list is empty, we shall look for all ancestor nodes. Since this method returns the first ancestor node-pair found, f this list is left non-empty, then the only ancestor nodes whose HTML Element Tag (usually referred to as "the token") are members of this varargs String parameter list shall be considered eligible as a return result for this method.

        FOR INSTANCE: If "DIV", "P", and "A" were listed - the search loop would skip over all ancestors that were not HTML divider, paragraph or anchor elements before selecting a result.
        Returns:
        This shall return every sub-list, as a 'DotPair' (start & end index pair). If no matches are found, an empty Vector of zero-elements shall return. These sublists are nearly identical to the Java-Script DOM Tree concept of ancestor-nodes, though no trees are constructed by this method.
        Throws:
        java.lang.ArrayIndexOutOfBoundsException - If index is not within the bounds of the passed vectorized-html parameter 'html'
        HTMLTokException - If any of the tags passed are null, or not found in the table of class HTMLTags - specifically if they are not valid HTML Elements.
        See Also:
        ALL(Vector, int, HTMLTagCounter), ARGCHECK.index(Vector, int)
        Code:
        Exact Method Body:
         return ALL(
             html, ARGCHECK.index(html, index),
             new HTMLTagCounter(htmlTags, HTMLTagCounter.NORMAL, HTMLTagCounter.ALL)
         );
        
      • allExcept

        🡅  🡇     🗕  🗗  🗖
        public static java.util.Vector<DotPairallExcept​
                    (java.util.Vector<? extends HTMLNode> html,
                     int index,
                     java.lang.String... htmlTags)
        
        This will find all ancestors of a given index. If parameter String... htmlTags is null, all HTML elements will be considered. If this parameter contains any elements, then those elements shall not be considered as a match in the ancestor hierarchy tree.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is the index of the node for whose "ancestors" we are searching (to use a Java-Script DOM Tree term).
        htmlTags - When this list is non-empty (contains at least one token), the search loop will skip over ancestor nodes that are among the members of this var-args parameter list. If this method is invoked and this parameter is an empty list, then the search loop will return all ancestor nodes of the index node.

        FOR INSTANCE: If "B" and "P" were passed as parameters to this method, then the search-loop which is saving all ancestor matches to it's result-set, would skip over any HTML 'bold' or 'paragraph' DotPair's.
        Returns:
        This shall return every sub-list, as a 'DotPair' (start & end index pair). If no matches are found, an empty Vector of zero-elements shall return. These sublists are nearly identical to the Java-Script DOM Tree concept of ancestor-nodes, though no trees are constructed by this method.
        Throws:
        java.lang.ArrayIndexOutOfBoundsException - If index is not within the bounds of the passed vectorized-html parameter 'html'
        HTMLTokException - If any of the tags passed are null, or not found in the table of class HTMLTags - specifically if they are not valid HTML Elements.
        See Also:
        ALL(Vector, int, HTMLTagCounter), ARGCHECK.index(Vector, int)
        Code:
        Exact Method Body:
         return ALL(
             html, ARGCHECK.index(html, index),
             new HTMLTagCounter(htmlTags, HTMLTagCounter.EXCEPT, HTMLTagCounter.ALL)
         );
        
      • FIRST

        🡅  🡇     🗕  🗗  🗖
        protected static DotPair FIRST​(java.util.Vector<? extends HTMLNode> html,
                                       int index,
                                       Torello.HTML.HTMLTagCounter tagCounter)
        Finds the first ancestor ("surrounding") node pair.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is any index within the bounds of the 'html' parameter.
        tagCounter - Any internally used counter, to optimize the search routine.
        Returns:
        The matching ancestor node's start-and-end index as a 'DotPair'.
        See Also:
        TagNode, HTMLNode, DotPair, DotPair.isInside(int), Util.Inclusive.dotPairOPT(Vector, int, int)
        Code:
        Exact Method Body:
         int     size = html.size();
         TagNode tn;
         DotPair ret;
        
         for (   int i=(index-1);
                 (i >= 0) && (! tagCounter.allBanned());
                 i--
         )
        
             if (    ((tn = html.elementAt(i).openTag()) != null)
                 &&  tagCounter.check(tn)
                 &&  ((ret = Util.Inclusive.dotPairOPT(html, i, size)) != null)
                 &&  ret.isInside(index)
                     // isInside(...) Should never fail, but 
             )       // This guarantees to prevent erroneous answers
        
                 // If there is a match, return that match, and exit immediately.
                 return ret;
        
         return null;
        
      • ALL

        🡅     🗕  🗗  🗖
        protected static java.util.Vector<DotPairALL​
                    (java.util.Vector<? extends HTMLNode> html,
                     int index,
                     Torello.HTML.HTMLTagCounter tagCounter)
        
        Finds all ancestor ("surrounding"} node pairs.
        Parameters:
        html - This may be any Vectorized-HTML Web-Page (or sub-page).

        The Variable-Type Wild-Card Expression '? extends HTMLNode' means that a Vector<TagNode>, Vector<TextNode> or Vector<CommentNode> will all be accepted by this paramter without causing an exception throw.

        These 'sub-type' Vectors are often returned as search results from the classes in the 'NodeSearch'vpackage.
        index - This is any index within the bounds of the 'html' parameter.
        tagCounter - Any internally used counter, to optimize the search routine.
        Returns:
        All matching ancestor nodes' start-and-end index pairs inside a Vector<DotPair>
        See Also:
        TagNode, HTMLNode, DotPair, DotPair.isInside(int), Util.Inclusive.dotPairOPT(Vector, int, int)
        Code:
        Exact Method Body:
         HTMLNode n;     TagNode tn;     DotPair dp;     int size = html.size();
         Vector<DotPair> ret = new Vector<>();
        
         for (int i=(index-1); (i >= 0) && (! tagCounter.allBanned()); i--)
        
             if (    (n = html.elementAt(i)).isTagNode()
                 &&  tagCounter.check(tn = (TagNode) n)
             )
             {
                 if (    ((dp = Util.Inclusive.dotPairOPT(html, i, size)) != null)
                     &&  dp.isInside(index)
                 )           // isInside(...) Should never fail, but 
                             // This guarantees to prevent erroneous answers
                     ret.addElement(dp);
        
                 else
                     // If finding a token match fails, just ignore that token from now on...
                     tagCounter.reportFailed(tn.tok);
        
             }
        
         return ret;