java.lang.Object
- Torello.HTML.HTMLPage

public class HTMLPage
extends java.lang.Object

Generating Vectorized-HTML:
(TL; DR) ==> In order to parse a String containing HTML, the following extremely simple invocation will do the trick:

Java Line of Code:

Vector<HTMLNode> html = HTMLPage.getPageTokens(htmlAsString, false);

This class has myriad, documented, methods for specifying from whence the HTML should be retrieved and how much of it should be parsed. There is also a feature for saving intermediate parse results to user-specified output text-files.

The flat-file generation methods were heavily used during the development of this package, but now are largely a legacy feature.

Java HTML's flagship-parser class for converting HTML web-pages into plain Java Vector's of HTMLNode.

The purpose of this class is just to parse page tokens from raw-HTML to vectorized-pages. What is returned is a Vector<HTMLNode> which is the contents of web-page retrieved from a website - or from an internally stored text-String that contains HTML/text.

Method Parameters

Parameter	Explanation
`URL url`	This is the url of a web-page containing text/html data. Class `HTMLPage` will connect to this url, download the byte-stream, and then parse it as an HTML page.
`CharSequence html`	If HTML has already been retrieved and stored locally, this html-data may be passed to this class by encapsulating the locally stored HTML-text inside a `StringBuilder, StringBuffer,` or just an ordinary `String` This `"CharSequence"` will be "queried/parsed" as if the data were being retrieved from a live-webserver generating HTML. When this parameter is used, no outgoing webserver connections will be made at all. Instead, this character-sequence (most often a java `String`) will be treated as if it were a web-server.
`BufferedReader br`	There are occasions when a web-server expects or requires a ,"specialized connection," like `ISO-8859` for instance. Sometimes a server will expect the connection to explicitly request that `UTF-8` chars will be sent/retrieved. When this is the case, a programmer may make such a specialized connection using the `Scrape.openConn(...)` methods - or make his own connection. So long as he may provide a valid java `BufferedReader` to return the HTML, then this `class HTMLPage` will parse that HTML and generate a vectorized-webpage of nodes.
`boolean eliminateHTMLTags`	When this is TRUE, only textual HTML data will be included in the return `Vector<HTMLNode>.` Specifically, all `TagNode` elements from the `Vector` will be removed immediately (not instantiated by the parser), and rather, just `TextNode` with any/all available textual-data found on the web-page will be returned. The return type could as well be: `Vector<TextNode>,` however this is not possible because java does not allow methods to alternative their return type very easily. NOTE: When this parameter is set to TRUE, the vectorized-webpage that is returned would be identical to one returned from a call to method `Util.removeAllTextNodes(page).` (And where `'page'` were a `Vector` retrieved from the exact-same web-address)
`int startLineNum`	This parameter will be used with class/method `Scrape.getHTML(BufferedReader br, int startLineNum, int endLineNum).` There, it is explained very well how to reduce a page-download to content that is explicitly found between two line-numbers (a start and end line-number). The purpose therein is to make searching the vectorized-page that is generated a little bit easier. Sometimes excessive header information may be useless, and can be discarded immediately. NOTE: If parameter `startLineNum` is `1, 0` then the parse will begin from the top/start of webpage. EXCEPTIONS: See class `Scrape` for method `StringBuffer getHTML(...)` for more information regarding what would cause invalid line numbers to generate exception throws.
`int endLineNum`	Same as above, but this parameter is passed to `int 'endLineNum'` inside method `Scrape.getHTML(int startLineNum, int endLineNum)` NOTE: If parameter `endLineNum` is negative, then the HTML data will be read and parsed until EOF is encountered. EXCEPTIONS: See class `Scrape` for method `StringBuffer getHTML(...)` for more information regarding what would cause invalid line numbers to generate exception throws.
`String startTag`	Same as above, but this parameter is passed to `String 'startTag'` inside method `Scrape.getHTML(BufferedReader br, String startTag, String endTag)` EXCEPTIONS: See class `Scrape` for method `StringBuffer getHTML(...)` for more information regarding what would cause invalid line numbers to generate exception throws.
`String endTag`	Same as above, but this parameter is passed to `String 'endTag'` inside method `Scrape.getHTML(String startTag, String endTag)` EXCEPTIONS: (Again) See class `Scrape` for method `StringBuffer getHTML(...)` for more information regarding what would cause invalid line numbers to generate exception throws.
`String rawHTMLFile`	When this parameter is included in the method-signature parameter list, all HTML retrieved from the web-server will be copied/dumped directly to a flat-file on the file-system named by this String `'rawHTMLFile.'` NOTE: For any one of the following these three parameters below, if a value of `'null'` is passed for the value of the file-name, that set of data will not be retrieved and a file by that name will not be saved. This can be useful, say for example, when only the regex data needs to be reviewed, but not the raw-HTML page-data.
`String matchesFile`	When this parameter is included, all regular-expression matcher information that is generated by the parser will be copied/sent to a flat-file on the file-system with this name `'matchesFile.'` This data may be used for debugging code. Generally, this information is not very useful, except for understanding regex. It is, however, kept here in these methods, available, for legacy purposes. The earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.
`String justTextFile`	When this parameter is included in the method-signature parameter list, all `TextNode` that are generated by the parser will be copied/dumped directly to a flat-file with the name in String `'justTextFile.'` This data may be used for quickly scanning the content of a webpage, but generally is not very useful. It is kept here for legacy purposes, and the earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.

Return Values:

All methods return an Vector<HTMLNode> and this represents a vectorized-HTML page whose elements are the parsed content of the web-page that served as input to the getPageTokens(...) method that you selected.

See Also:: Scrape.getHTML(BufferedReader, int, int), Scrape.getHTML(BufferedReader, String, String), HTMLPageMWT

Hi-Lited Source-Code:

This File's Source Code:

View Here: Torello/HTML/HTMLPage.java
Open New Browser-Tab: Torello/HTML/HTMLPage.java

File Size: 21,366 Bytes Line Count: 483 '\n' Characters Found
HTML Regular-Expression Parser Class:

View Here: Torello/HTML/HelperPackages/parse/ParserRE.java
Open New Browser-Tab: Torello/HTML/HelperPackages/parse/ParserRE.java

File Size: 2,596 Bytes Line Count: 71 '\n' Characters Found
HTML Parser, Inner-Loop Class:

View Here: Torello/HTML/HelperPackages/parse/ParserREInternal.java
Open New Browser-Tab: Torello/HTML/HelperPackages/parse/ParserREInternal.java

File Size: 9,690 Bytes Line Count: 212 '\n' Characters Found
HTML Regular-Expressions Class:

View Here: Torello/HTML/HelperPackages/parse/HTMLRegEx.java
Open New Browser-Tab: Torello/HTML/HelperPackages/parse/HTMLRegEx.java

File Size: 1,851 Bytes Line Count: 27 '\n' Characters Found

Stateless Class:

This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.

1 Constructor(s), 1 declared private, zero-argument constructor
18 Method(s), 18 declared static
1 Field(s), 1 declared static, 0 declared final
Fields excused from final modifier (with explanation):

Field 'parser' is not final. Reason: SINGLETON

Nested Class Summary

Nested Classes
Modifier and Type Class

static interface HTMLPage.Parser

Field Summary

Fields
Modifier and Type Field

static HTMLPage.Parser parser

Method Summary

Standard Parse

Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags)`

Standard Parse, w/ Debug Files
Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)`

Page Limited: Line-Number
Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum)`

Page Limited: Line-Number, w/ Debug Files
Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)`

Page Limited: Start & End Substring
Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags, String startTag, String endTag)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags, String startTag, String endTag)`

Page Limited: Start & End String, w/ Debug Files
Modifier and Type	Method
`static Vector<HTMLNode>`	`getPageTokens(BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(CharSequence html, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)`
`static Vector<HTMLNode>`	`getPageTokens(URL url, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - parser
    
    🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static HTMLPage.Parser parser
    
    If needing to "swap a proprietary parser" comes up, this is possible. It just needs to accept the same parameters as the current parser, and produce a Vector<HTMLNode>. This is not an advised step to take, but if an alternative parser has been tested and happens to be generating different results, it can be easily 'swapped out' for the one used now.
    
    See Also:
    
    HTMLPage.Parser, HTMLPage.Parser.parse(java.lang.CharSequence, boolean, java.lang.String, java.lang.String, java.lang.String)
    
    Code:
    
    Exact Field Declaration Expression:
    
    public static Parser parser = Torello.HTML.HelperPackages.parse.ParserRE::parsePageTokens;
- Method Detail
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    Passes null to parameters startTag, endTag, rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens (Scrape.openConn(url), eliminateHTMLTags, null, null, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    And-Accepts: 'startTag' and 'endTag'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens (Scrape.openConn(url), eliminateHTMLTags, startTag, endTag, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    And-Accepts: 'startLineNum' and 'endLineNum'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, int, int, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens (Scrape.openConn(url), eliminateHTMLTags, startLineNum, endLineNum, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    Passes null to startTag & endTag parameters.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens( Scrape.openConn(url), eliminateHTMLTags, null, null, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    And-Accepts: 'startTag' and 'endTag'
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens( Scrape.openConn(url), eliminateHTMLTags, startTag, endTag, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.net.URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Convenience Method
    Accepts: URL
    And-Accepts: 'startLineNum' and 'endLineNum'
    Invokes: getPageTokens(BufferedReader, boolean, int, int, String, String, String)
    And Invokes: Scrape.openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens( Scrape.openConn(url), eliminateHTMLTags, startLineNum, endLineNum, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags)
    
    Parses and Vectorizes HTML from a CharSequence (usually a String) source.
    
    Parameters:
    
    html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
    
    eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
    
    Returns:
    
    A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
    
    NOTE: This method does not throw any checked-exceptions, there is no Input-Output involved here, it is strictly a computational method that neither invokes the file-system, nor the web.
    
    Code:
    
    Exact Method Body:
    
    try { return parser.parse(html, eliminateHTMLTags, null, null, null); } // This should never happen, when reading from a 'String' rather than a URL, or // BufferedReader ==> IOException will not be thrown. catch (IOException ioe) { throw new UnreachableError(ioe); }
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag)
    
    Convenience Method
    Accepts: CharSequence
    And-Accepts: 'startTag' and 'endTag'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(CharSequence, boolean, String, String, String, String, String)
    Catches: IOException ==> No HTTP-I/O, so an IOException isn't possible!
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
    
    Convenience Method
    Accepts: CharSequence
    And-Accepts: 'startLineNum' and 'endLineNum'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(CharSequence, boolean, int, int, String, String, String)
    Catches: IOException ==> No HTTP-I/O, so an IOException isn't possible!
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Parses and Vectorizes HTML from a CharSequence (usually a String) source.
    
    Parameters:
    
    html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
    
    eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
    
    rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).
    
    matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.
    
    justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.
    
    Returns:
    
    A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
    
    Throws:
    
    java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
    
    Code:
    
    Exact Method Body:
    
    return parser.parse(html, eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Parses and Vectorizes HTML from a CharSequence (usually a String) source.
    
    Parameters:
    
    html - This may be any form of java.lang.CharSequence, and it will be converted into a String. This should contain HTML that needs to be parsed, and vectorized.
    
    eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
    
    startTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
    
    endTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
    
    rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).
    
    matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.
    
    justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.
    
    Returns:
    
    A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
    
    Throws:
    
    java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
    
    ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
    
    Code:
    
    Exact Method Body:
    
    String htmlStr = html.toString(); int sPos = htmlStr.indexOf(startTag); if (sPos == -1) throw new IllegalArgumentException ("Passed String-Parameter 'startTag' [" + startTag + "] was not found in HTML."); int ePos = htmlStr.indexOf(endTag, sPos); if (ePos == -1) throw new IllegalArgumentException ("Passed String-Parameter 'endTag' [" + endTag + "] was not found in HTML."); ePos += endTag.length(); return parser.parse( htmlStr.substring(sPos, ePos), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.lang.CharSequence html, boolean eliminateHTMLTags, int startLineNum, int endLineNum, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Convenience Method
    Accepts: CharSequence
    And-Accepts: 'startLineNum' and 'endLineNum'
    Invokes: getPageTokens(BufferedReader, boolean, int, int, String, String, String)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens( new BufferedReader(new StringReader(html.toString())), eliminateHTMLTags, startLineNum, endLineNum, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags) throws java.io.IOException
    
    Convenience Method
    Accepts: BufferedReader
    Passes null to parameters startTag, endTag, rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens(br, eliminateHTMLTags, null, null, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException
    
    Convenience Method
    Accepts: BufferedReader
    And-Accepts: 'startTag' and 'endTag'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens(br, eliminateHTMLTags, startTag, endTag, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum) throws java.io.IOException
    
    Convenience Method
    Accepts: BufferedReader
    And-Accepts: 'startLineNum' and 'endLineNum'
    Passes null to parameters rawHTMLFile, matchesFile & justTextFile.
    Invokes: getPageTokens(BufferedReader, boolean, int, int, String, String, String)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens(br, eliminateHTMLTags, startLineNum, endLineNum, null, null, null);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Convenience Method
    Accepts: BufferedReader
    Passes null to startTag & endTag parameters.
    Invokes: getPageTokens(BufferedReader, boolean, String, String, String, String, String)
    
    Code:
    
    Exact Method Body:
    
    return getPageTokens (br, eliminateHTMLTags, null, null, rawHTMLFile, matchesFile, justTextFile);
  - getPageTokens
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Parses and Vectorizes HTML from a BufferedReader source.
    
    Parameters:
    
    br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
    
    eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
    
    startTag - If this parameter is non-null, the scrape-logic will skip all content before finding the substring 'startTag'. Parsing HTML will not begin until this token is identified somewhere in the input-source.
    
    endTag - If this parameter is non-null, the scrape-logic will skip all content after the substring 'endTag' is identified in the input-source.
    
    rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).
    
    matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.
    
    justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.
    
    Returns:
    
    A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
    
    Throws:
    
    java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
    
    ScrapeException - If either startTag or endTag are non-null, but also not-found on the input-page.
    
    Code:
    
    Exact Method Body:
    
    return parser.parse( Scrape.getHTML(br, startTag, endTag), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile );
  - getPageTokens
    
    🡅 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<HTMLNode> getPageTokens (java.io.BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException
    
    Parses and Vectorizes HTML from a BufferedReader source.
    
    Parameters:
    
    br - This BufferedReader will be scanned, and the HTML saved to a String. Then it is parsed into HTMLNode's and returned as an HTML Vector.
    
    eliminateHTMLTags - When this parameter is TRUE, all TagNode and CommentNode elements are eliminated from the returned HTML Vector. A Vector having only the page-text (as instances of TextNode) is returned, instead.
    
    startLineNum - This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is before 'startLineNum'
    
    endLineNum - This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector whose line-number is after 'endLineNum'
    
    rawHTMLFile - If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter 'rawHTMLFile'. If this parameter is null, it will be ignored (and the raw-HTML discarded).
    
    matchesFile - If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.
    
    justTextFile - If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTML TagNode or CommentNode - will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.
    
    Returns:
    
    A Vector of HTMLNode's (called 'Vectorized HTML') that represents the available parsed-content provided by the input-source.
    
    Throws:
    
    java.io.IOException - This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).
    
    java.lang.IllegalArgumentException - if parameter startLineNum is negative, or endLineNum is before startLineNum.
    
    ScrapeException - If either startLineNum or endLineNum are integers greater than the number of lines on the web-page (or sub-page).
    
    Code:
    
    Exact Method Body:
    
    return parser.parse( Scrape.getHTML(br, startLineNum, endLineNum), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile );

Class HTMLPage

Method Parameters

Return Values:

Nested Class Summary

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

parser

Method Detail

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens

getPageTokens