Package Torello.HTML.NodeSearch

The purpose of these classes is to allow a programmer to "search" through webpages that have been vectorized and downloaded to Java Vector<HTMLNode>.

The following key words are important to understand when deciding on an appropriate search class and search method:

  • InnerTag: This word means that the attributes inside an HTML TagNode element are used to search for and identify TagNode matches.
  • TagNode: This implies that only the HTML TagNode element's (TagNode.tok field) will be used to specify search criteria. InnerTag's - a.k.a. 'attributes' - will not be used to specify the search.
  • TextNode: Use of this word (in a class) shall mean that TagNode elements will be ignored completely, and instead, the "text" inside an HTML page or sub-page is searched by means of 'TextNode' elements.
  • CommentNode: Use of this word (in a class) shall mean that the search-specifier will ignore all TagNode and TextNode elements, and instead focus on the contents of HTML CommentNode's within an HTML page or sub-page.

The following key words are also important, and will explain some 'Nuances' for the HTML search methods:

  • Count: This implies that a count of the number of nodes that have matched a specified search criteria shall be computed. Methods in 'Count' classes will always return simple-integers that represent this count.
  • Find: This implies that integer-arrays, or simple-integers are returned by the methods in any of the classes with the word 'Find' in the class' name. These integers are intended to function as pointers into the underlying Java Vector<HTMLNode>.
  • Get: This implies that HTMLNode's, themselves (TagNode, TextNode etc...), are returned by the methods in any of these classes. Integer-pointers (a.k.a. the integer-index into the underlying Vector<HTMLNode) are not returned.
  • Peek: This implies that BOTH the Vector-index AND the HTMLNode found at-that-index-location are SIMULTANEOUSLY returned by the methods in a class having the word 'Peek' in its name. It is here that the (sort-of) 'simple' and 'extra' data-classes 'TagNodeIndex', 'TextNodeIndex', etc... are used. They are for the return values of the 'Peek' methods.
  • Poll: This refers to the operation of BOTH removing a node from the vectorized-html web-page, AND returning the node (or nodes) that were removed back to the programmer as a return value. Remember, for all methods in classes that have the word 'Poll' in their name, after the method is finished the Vector<HTMLNode> will, indeed, contain fewer elements.
  • Remove: This implies that neither nodes nor node-pointers are returned, and furthermore the nodes are simply removed from the page. An integer-value stating to the caller exactly how many nodes were removed is returned. Remember, after a 'remove' operation, the initial vectorized-html will contain fewer elements.

Inclusive: Similar to JavaScript '.innerHTML'

The key-word "inclusive" should probably be explained here. Mostly, "inclusive" is actually quite similar to the Java-Script concept of '.innerHTML'. This object-field is a field in most of the nodes within in a Java-Script DOM Tree. It used to retrieve every node between the opening element ('<DIV ..>' for example) and its corresponding closing-element ('</DIV>').

When a TagNode is searched using either an 'InnerTag-Search' (attribute key-value pair), or a simple TagNode-Search method, the the opening-tag, the closing-tag - and every HTMLNode between these two are returned by 'inclusive' methods.