Package Torello.HTML

Interface HTMLPage.Parser

  • Enclosing class:
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    public static interface HTMLPage.Parser
    A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficient.

    This Functional Interface is identical to QuintFunction<A, B, C, D, E, X> in the 'Java.Additional'package, but adds the ability to throw an IOException. Having the ability to "swap parsers" is actually not a very important 'feature' - unless one has identified a way to optimize past the abilities of the current parser, or desires something different altogether. This 'feature' shall remain in place since there is essentially zero over-head costs incurred here. To see the actual parser code used by this package, view the documentation for class-HTMLPage, and scroll to 'View Source Files'.

    NOTE: If one desired, for instance, to ignore the debugging log-files feature, that is easily done by ignoring the three file-name parameters. However, this can easily be achieved in class HTMLPage by invoking one of the methods where those log file-names are passed null-value strings.
    See Also:

    • Method Summary

      @FunctionalInterface: (Lambda) Method
      Modifier and Type Method
      Vector<HTMLNode> parse​(CharSequence html, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
    • Method Detail

      • parse

          🗕  🗗  🗖
        java.util.Vector<HTMLNodeparse​(java.lang.CharSequence html,
                                         boolean eliminateHTMLTags,
                                         java.lang.String rawHTMLFile,
                                         java.lang.String matchesFile,
                                         java.lang.String justTextFile)
        Parse html source-text into a Vector<HTMLNode>.
        html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
        eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
        rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).

        NOTE: If you have decided to implement a parser, and you wish to ingore this parameter (and don't want to output such a file) - it is (hopefully) obvious that you may skip this step!
        matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.

        NOTE: As above, you may skip implementing this.
        justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.

        NOTE: As above, you may skip implementing this.
        A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
        Throws: - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).