Package Torello.HTML
Interface HTMLPage.Parser
-
- Enclosing class:
- HTMLPage
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
@FunctionalInterface public static interface HTMLPage.Parser
A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficient.
ThisFunctional Interface
is identical toQuintFunction<A, B, C, D, E, X>
in the'Java.Additional'
package, but adds the ability to throw anIOException
. Having the ability to "swap parsers" is actually not a very important 'feature' - unless one has identified a way to optimize past the abilities of the current parser, or desires something different altogether. This 'feature' shall remain in place since there is essentially zero over-head costs incurred here. To see the actualparser
code used by this package, view the documentation forclass-HTMLPage
, and scroll to 'View Source Files'.
NOTE: If one desired, for instance, to ignore the debugging log-files feature, that is easily done by ignoring the three file-name parameters. However, this can easily be achieved inclass HTMLPage
by invoking one of the methods where those log file-names are passed null-value strings.- See Also:
HTMLPage.parser
Hi-Lited Source-Code:- View Here: Torello/HTML/HTMLPage.java
- Open New Browser-Tab: Torello/HTML/HTMLPage.java
File Size: 1,852 Bytes Line Count: 35 '\n' Characters Found
-
-
Method Detail
-
parse
java.util.Vector<HTMLNode> parse(java.lang.CharSequence html, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
Parse html source-text into aVector<HTMLNode>
.- Parameters:
html
- This may be any form ofjava.lang.CharSequence
, and it will be converted into aString
. This should contain HTML that needs to be parsed, and vectorized.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).
NOTE: If you have decided to implement a parser, and you wish to ingore this parameter (and don't want to output such a file) - it is (hopefully) obvious that you may skip this step!matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.
NOTE: As above, you may skip implementing this.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.
NOTE: As above, you may skip implementing this.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
-
-