Package Torello.HTML
Class Surrounding
- java.lang.Object
-
- Torello.HTML.Surrounding
-
public class Surrounding extends java.lang.Object
Class for finding ancestor & parent nodes of any selectedHTMLNode
.Substitute for the DOM-Tree concepts of 'parent' and 'ancestor'
Class'Surrounding'
is intended to function in place of the Java-Script DOM Tree (Document Object Model Tree) concept known as "Parent" or "Ancestor". Generally, thinking about documents as trees perhaps makes some parts of java-script a little easier to program, unfortunately, for any content-based page - it is much more consistent with "the intentions" of the author or writer of the page to think of HTML as a list (TextNode's
&TagNode's
etc...). The name "HTML" means Hyper-Text Markup-Language, meaning that text documents are just "Marked" by HTML Elements, so using JavaVector's
(instead of DOM Trees) is the guiding philosophy.
There are other HTML Parsers which build DOM Trees, and generally, those parser are quick to modify the HTML being parsed if any unmatched closing-tags ("Elements") are found. Instead, here the philosophy is that the HTML is presumed valid, and if an unmatched closing HTML Tags are present, an'Inclusive'
search would simply not produce the expected result. Since the vast majority of uses for this package would be scraping news & informational sites - all of which have automatically generated HTML - worrying about unclosed HTML is best left for the "Browser Pioneers" who write the rendering functions for web-pages, and mostly ignoring the concept here.
The following example will demonstrate how finding the parent and ancestor nodes at a particular index. This example parses one of the documentation pages found on the JavaDocs for this package. It then picks a particularTextNode
instance, and asks for all of the HTML Elements whose opening and closing tags "enclose" theTextNode
Example:
// Load the documentation html page into vectorized-html StringBuffer sb = new StringBuffer(); URL url = new URL("http://developer.torello.directory/JavaHTML/Version%201/1.4/javadoc/Torello/HTML/NodeSearch/CommentNodeCount.html"); Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false); // Obtain a vector-index pointer to the text-node containing the indicated string: // "a count of how many" // This is a line of text from the JavaDoc HTML Page that was loaded above. int pos = TextNodeFind.first(page, TextComparitor.CN_CI, "a count of how many"); // Print the output found above to a StringBuffer sb.append("Text Node Found: [" + page.elementAt(pos) + "]\n"); // Find the first "ancestor node" or "parent node" of this TextNode // Restrict the search to leave out: <LI>, <BODY> or <DIV> DotPair dp = Surrounding.firstExcept(page, pos, "li", "body", "div"); // Print the output of this search to the StringBuffer / Log sb.append("Index Found: " + pos + ", DotPair Found: " + dp.toString() + "\n"); sb.append(Debug.printJ(page, dp) + "\n"); // Now print all "ancestor nodes" (Surrounding nodes) - leave out <BODY>, <HTML> and <DIV> // ancestors Vector<DotPair> allDP = Surrounding.allExcept(page, pos, "body", "html", "div"); for (DotPair l : allDP) sb.append( C.BCYAN + "************************************************************\n" + "************************************************************\n" + C.RESET + "Index Found: " + pos + ", DotPair Found: " + l.toString() + "\n" + "Starting Node: " + C.BRED + page.elementAt(l.start).str + C.RESET + "\n" + "Ending Node:" + C.BRED + page.elementAt(l.end).str + C.RESET + "\n" ); // Print the StringBuffer / Log to Standard-Out, and to the text-file "out.html" String s = sb.toString(); System.out.println(s); // NOTE: The above "Printing" uses the Shell.C class (which are UNIX Color-Codes) // This converts those color-codes to HTML <SPAN>...</SPAN> Elements FileRW.writeFile(C.toHTML(s.replace("<", "&lt;").replace(">", "&gt;")), "out.html");
The above example would print these results to a UNIX terminal:
Text Node Found: [ returns a count of how many TextNode's were identified on the vectorized-page parameter
]
Index Found: 698, DotPair Found: [690, 705]
[<ol>][
][<li>][<b>][<code>][int][</code>][</b>][ returns a count of how many TextNode's were identified on the vectorized-page parameter
][<code>]['html'][</code>][ that contained text that matched the specified criteria][</li>][
][</ol>]
************************************************************
************************************************************
Index Found: 698, DotPair Found: [692, 703]
Starting Node: <li>
Ending Node:</li>
************************************************************
************************************************************
Index Found: 698, DotPair Found: [690, 705]
Starting Node: <ol>
Ending Node:</ol>
************************************************************
************************************************************
Index Found: 698, DotPair Found: [284, 817]
Starting Node: <ul class="blockList">
Ending Node:</ul>
Hi-Lited Source-Code:This File's Source Code:
- View Here: Torello/HTML/Surrounding.java
- Open New Browser-Tab: Torello/HTML/Surrounding.java
File Size: 15,535 Bytes Line Count: 332 '\n' Characters Found
Surrounding Helper Class:
- View Here: HTML Processors/Surrounding/HTMLTagCounter.java
- Open New Browser-Tab: HTML Processors/Surrounding/HTMLTagCounter.java
File Size: 2,336 Bytes Line Count: 75 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 7 Method(s), 7 declared static
- 0 Field(s)
-
-
Method Summary
First Ancestor Modifier and Type Method static DotPair
first(Vector<? extends HTMLNode> html, int index, String... htmlTags)
static DotPair
firstExcept(Vector<? extends HTMLNode> html, int index, String... htmlTags)
All Ancestors Modifier and Type Method static Vector<DotPair>
all(Vector<? extends HTMLNode> html, int index, String... htmlTags)
static Vector<DotPair>
allExcept(Vector<? extends HTMLNode> html, int index, String... htmlTags)
Protected, Internal Methods Modifier and Type Method protected static Vector<DotPair>
ALL(Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
protected static DotPair
FIRST(Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
-
-
-
Method Detail
-
first
public static DotPair first(java.util.Vector<? extends HTMLNode> html, int index, java.lang.String... htmlTags)
This will return the first ancestor node - along with it's closing element - as aDotPair
- that matches.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is the index of the node for whose "ancestors" we are searching (to use a Java-Script DOM Tree term).htmlTags
- If this list is empty, we shall look for any ancestor node. Since this method returns the first, if this list is left empty, and the index-node is surrounded by even a bold "<B>...</B>
" then that will be theDotPair
result that is returned. If this list is left non-empty, then the only ancestor nodes whose HTML Element Tag (usually referred to as "the Element") matches a tag from this list shall be returned.
FOR INSTANCE: If"div", "p"
, and"a"
were provided as values to this parameter - the search loop would skip over all ancestors that were not HTML divider, paragraph or anchor elements before selecting a result.- Returns:
- This shall return the first sub-list, as a
'DotPair'
(start & end index pair). If no matches are found, null will return. This sublist is nearly identical to the Java-Script DOM Tree concept of ancestor-node, though no trees are constructed by this method. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If index is not within the bounds of the passed vectorized-html parameter'html'
HTMLTokException
- If any of the tags passed are null, or not found in the table ofclass HTMLTags
- specifically if they are not valid HTML Elements.- See Also:
FIRST(Vector, int, HTMLTagCounter)
,ARGCHECK.index(Vector, int)
- Code:
- Exact Method Body:
return FIRST( html, ARGCHECK.index(html, index), new HTMLTagCounter(htmlTags, HTMLTagCounter.NORMAL, HTMLTagCounter.FIRST) );
-
firstExcept
public static DotPair firstExcept (java.util.Vector<? extends HTMLNode> html, int index, java.lang.String... htmlTags)
This will return the first ancestor node - along with it's closing element - as aDotPair
- that matches the input-parameter'htmlTags'
In this case, the term'except'
shall mean that any matches whose HTML Token is among the list in parameterString... htmlTags
will be skipped, and a "higher-level" ancestor will be returned instead.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is the index of the node for whose "ancestors" we are searching (to use a Java-ScriptDOM Tree
term).htmlTags
- When this list is non-empty (contains at least one token), the search loop will skip over ancestor nodes that are among the members of this var-args parameter list. If this method is invoked and this parameter is an empty list, then the search loop will return the first anestor node identified.
FOR INSTANCE: If"B"
and"P"
were passed as parameters to this method, then the search-loop will continue looking for higher-level ancestors - until one was found that was not an HTML'bold'
or'paragraph'
elementDotPair
.- Returns:
- This shall return the first sub-list, as a
'DotPair'
(start & end index pair). If no matches are found, null will return. This sublist is nearly identical to the Java-Script DOM Tree concept of ancestor-node, though no trees are constructed by this method. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If index is not within the bounds of the passed vectorized-html parameter'html'
HTMLTokException
- If any of the tags passed are null, or not found in the table ofclass HTMLTags
- specifically if they are not valid HTML Elements.- See Also:
FIRST(Vector, int, HTMLTagCounter)
,ARGCHECK.index(Vector, int)
- Code:
- Exact Method Body:
return FIRST( html, ARGCHECK.index(html, index), new HTMLTagCounter(htmlTags, HTMLTagCounter.EXCEPT, HTMLTagCounter.FIRST) );
-
all
public static java.util.Vector<DotPair> all (java.util.Vector<? extends HTMLNode> html, int index, java.lang.String... htmlTags)
This will find all ancestors of a given index. If parameterString... htmlTags
is null, all HTML elements will be considered. If this parameter contains any elements, then only those elements shall be considered as match in the ancestor hierarchy tree.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is the index of the node for whose "ancestors" we are searching (to use a Java-ScriptDOM Tree
term).htmlTags
- If this list is empty, we shall look for all ancestor nodes. Since this method returns the first ancestor node-pair found, f this list is left non-empty, then the only ancestor nodes whose HTML Element Tag (usually referred to as "the token") are members of this varargsString
parameter list shall be considered eligible as a return result for this method.
FOR INSTANCE: If"DIV", "P"
, and"A"
were listed - the search loop would skip over all ancestors that were not HTML divider, paragraph or anchor elements before selecting a result.- Returns:
- This shall return every sub-list, as a
'DotPair'
(start & end index pair). If no matches are found, an emptyVector
of zero-elements shall return. These sublists are nearly identical to the Java-Script DOM Tree concept of ancestor-nodes, though no trees are constructed by this method. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If index is not within the bounds of the passed vectorized-html parameter'html'
HTMLTokException
- If any of the tags passed are null, or not found in the table ofclass HTMLTags
- specifically if they are not valid HTML Elements.- See Also:
ALL(Vector, int, HTMLTagCounter)
,ARGCHECK.index(Vector, int)
- Code:
- Exact Method Body:
return ALL( html, ARGCHECK.index(html, index), new HTMLTagCounter(htmlTags, HTMLTagCounter.NORMAL, HTMLTagCounter.ALL) );
-
allExcept
public static java.util.Vector<DotPair> allExcept (java.util.Vector<? extends HTMLNode> html, int index, java.lang.String... htmlTags)
This will find all ancestors of a given index. If parameterString... htmlTags
is null, all HTML elements will be considered. If this parameter contains any elements, then those elements shall not be considered as a match in the ancestor hierarchy tree.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is the index of the node for whose "ancestors" we are searching (to use a Java-ScriptDOM Tree
term).htmlTags
- When this list is non-empty (contains at least one token), the search loop will skip over ancestor nodes that are among the members of this var-args parameter list. If this method is invoked and this parameter is an empty list, then the search loop will return all ancestor nodes of the index node.
FOR INSTANCE: If"B"
and"P"
were passed as parameters to this method, then the search-loop which is saving all ancestor matches to it's result-set, would skip over any HTML'bold'
or'paragraph'
DotPair's
.- Returns:
- This shall return every sub-list, as a
'DotPair'
(start & end index pair). If no matches are found, an emptyVector
of zero-elements shall return. These sublists are nearly identical to the Java-Script DOM Tree concept of ancestor-nodes, though no trees are constructed by this method. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If index is not within the bounds of the passed vectorized-html parameter'html'
HTMLTokException
- If any of the tags passed are null, or not found in the table ofclass HTMLTags
- specifically if they are not valid HTML Elements.- See Also:
ALL(Vector, int, HTMLTagCounter)
,ARGCHECK.index(Vector, int)
- Code:
- Exact Method Body:
return ALL( html, ARGCHECK.index(html, index), new HTMLTagCounter(htmlTags, HTMLTagCounter.EXCEPT, HTMLTagCounter.ALL) );
-
FIRST
protected static DotPair FIRST(java.util.Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
Finds the first ancestor ("surrounding") node pair.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is any index within the bounds of the'html'
parameter.tagCounter
- Any internally used counter, to optimize the search routine.- Returns:
- The matching ancestor node's start-and-end index as a
'DotPair'
. - See Also:
TagNode
,HTMLNode
,DotPair
,DotPair.isInside(int)
,Util.Inclusive.dotPairOPT(Vector, int, int)
- Code:
- Exact Method Body:
int size = html.size(); TagNode tn; DotPair ret; for ( int i=(index-1); (i >= 0) && (! tagCounter.allBanned()); i-- ) if ( ((tn = html.elementAt(i).openTag()) != null) && tagCounter.check(tn) && ((ret = Util.Inclusive.dotPairOPT(html, i, size)) != null) && ret.isInside(index) // isInside(...) Should never fail, but ) // This guarantees to prevent erroneous answers // If there is a match, return that match, and exit immediately. return ret; return null;
-
ALL
protected static java.util.Vector<DotPair> ALL (java.util.Vector<? extends HTMLNode> html, int index, Torello.HTML.HTMLTagCounter tagCounter)
Finds all ancestor ("surrounding"} node pairs.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.index
- This is any index within the bounds of the'html'
parameter.tagCounter
- Any internally used counter, to optimize the search routine.- Returns:
- All matching ancestor nodes' start-and-end index pairs
inside a
Vector<DotPair>
- See Also:
TagNode
,HTMLNode
,DotPair
,DotPair.isInside(int)
,Util.Inclusive.dotPairOPT(Vector, int, int)
- Code:
- Exact Method Body:
HTMLNode n; TagNode tn; DotPair dp; int size = html.size(); Vector<DotPair> ret = new Vector<>(); for (int i=(index-1); (i >= 0) && (! tagCounter.allBanned()); i--) if ( (n = html.elementAt(i)).isTagNode() && tagCounter.check(tn = (TagNode) n) ) { if ( ((dp = Util.Inclusive.dotPairOPT(html, i, size)) != null) && dp.isInside(index) ) // isInside(...) Should never fail, but // This guarantees to prevent erroneous answers ret.addElement(dp); else // If finding a token match fails, just ignore that token from now on... tagCounter.reportFailed(tn.tok); } return ret;
-
-