Package Torello.HTML
Class HTMLPageMWT
- java.lang.Object
-
- Torello.HTML.HTMLPageMWT
-
public class HTMLPageMWT extends java.lang.Object
Generating Vectorized-HTML (with Timeout):
(TL; DR) ==> In order to parse aString
containing HTML (and ensure that server-hangs are handled approriately), the following extremely simple invocation will do the trick:
Java Line of Code:
Vector<HTMLNode> html = HTMLPageMWT.getPageTokens(htmlAsString, false);
This class has myriad, documented, methods for specifying from whence the HTML should be retrieved and how much of it should be parsed. There is also a feature for saving intermediate parse results to user-specified output text-files.
The flat-file generation methods were heavily used during the development of this package, but now are largely a legacy feature.A carbon-copy of classHTMLPage
, augmented with a mechanism for setting a timeout so that when scraping web-pages andURL's
from servers that might have a tendency to hang, freeze, or delay - the Java Virtual Machine can skip and move-on when that timeout expires.
MWT: Maximum Wait Time
This class uses a "Cached Thread Pool" to spawn a thread that watches the downloading of HTML sequences from a web server. The user must provide atimeout
and aTimeUnit
to specify the maximum amount of time that this class should wait when querying a web-server.
Generally, most commonly used web-servers on the internet will respond very quickly, *or* they will reply with one of the usual errors:HTTP 404, HTTP 503, HTTP 400
etc. A web-server actually locking, hanging and freezing your program's download-progress (in-essence, freezing your whole program) is uncommon. However, there a few web-URL's
that do not throw the typicalIOException, FileNotFoundException
, nor do they return an empty-page message. In the occasion that the site just hangs, setting a maximum wait time facilitates avoiding program execution hang.
Thread Safe:
This class uses a cachedthread pool
. There is an'executor'
global variable that is safely locked, using a semaphore. This service spawns a monitorThread
from aThread Pool
whose only purpose is to make sure the requested "maximum" wait-time is not exceeded.
As is clearly indicated by the'Stateless Class'
Report (several lines below), there are only three global-static
fields in this class. Furthermore there are no constructors (public or private), which all adds up to this class having no program state and therefore being extremelyThread
-Safe.
Exceptions:
In the event that an exception has been thrown during the web-server polling, this class may throw an'InterruptedException.'
This can only occur if the monitor-Thread
was stopped. This methods in this class might also throw any of the usually-expectedRuntimeException's
(uncheckd) that generally occur when polling a web-server - includingFileNotFoundException
andIOException
.
If, by any chance, ajava.util.concurrent.RejectedExecutionException
is thrown, make sure to check the value ofe.getCause();
to see what has occurred. Note that this is also an unchecked exception (therefore 'catching' for this exception is not mandatory).
Timeouts:
If atimeout
occurs (the maximum wait time has been exceeded), this class will not actually throw ajava.util.concurrent.TimeoutException
, but rather the code in this class catches that exception, and instead simply returns anull
as a result for the method.
Note that all methods in this class would return an Empty-Vector, and would never actually return null as a result of a method-call - except in the case where a timeout has, indeed, occured.
Halting the Monitor-Thread:
If this class is used by a programmer (which has a time-out check, and time-out monitor-thread), when that program is ready to exit, the programmer might not see his program exit immediately. Java'sExecutors
class builds aThread Pool
(and a time-outThread
). ThisThread
stays alive (but unused) most of the time.
If you have used this class, make sure to call the following method before your program completes, or you may find the Java-Virtual-Machine idly waiting for up to 30-seconds before dying and relinquishing control back to your operating-system.
// Call this before your program terminates! // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating, // before the JRE finally kills the monitor-thread. HTMLPageMWT.shutdownMWTThreads();
ClassHTMLPage
COMMENTS BELOW:
The purpose of this class is just to parse page tokens from raw-HTML to vectorized-pages. What is returned is aVector<HTMLNode>
which is the contents of web-page retrieved from a website - or from an internally stored text-String
that contains HTML/text.Method Parameters
Parameter Explanation URL url
This is the url of a web-page containing text/html data. Class HTMLPage
will connect to this url, download the byte-stream, and then parse it as an HTML page.CharSequence html
If HTML has already been retrieved and stored locally, this html-data may be passed to this class by encapsulating the locally stored HTML-text inside a StringBuilder, StringBuffer,
or just an ordinaryString
This"CharSequence"
will be "queried/parsed" as if the data were being retrieved from a live-webserver generating HTML. When this parameter is used, no outgoing webserver connections will be made at all. Instead, this character-sequence (most often a javaString
) will be treated as if it were a web-server.BufferedReader br
There are occasions when a web-server expects or requires a ,"specialized connection," like ISO-8859
for instance. Sometimes a server will expect the connection to explicitly request thatUTF-8
chars will be sent/retrieved. When this is the case, a programmer may make such a specialized connection using theScrape.openConn(...)
methods - or make his own connection. So long as he may provide a valid javaBufferedReader
to return the HTML, then thisclass HTMLPage
will parse that HTML and generate a vectorized-webpage of nodes.boolean eliminateHTMLTags
When this is TRUE, only textual HTML data will be included in the return Vector<HTMLNode>.
Specifically, allTagNode
elements from theVector
will be removed immediately (not instantiated by the parser), and rather, justTextNode
with any/all available textual-data found on the web-page will be returned. The return type could as well be:Vector<TextNode>,
however this is not possible because java does not allow methods to alternative their return type very easily.
NOTE: When this parameter is set to TRUE, the vectorized-webpage that is returned would be identical to one returned from a call to methodUtil.Remove.allTagNodes(page)
(And where'page'
were aVector
retrieved from the exact-same web-address)int startLineNum
This parameter will be used with class/method Scrape.getHTML(BufferedReader br, int startLineNum, int endLineNum).
There, it is explained very well how to reduce a page-download to content that is explicitly found between two line-numbers (a start and end line-number). The purpose therein is to make searching the vectorized-page that is generated a little bit easier. Sometimes excessive header information may be useless, and can be discarded immediately.
NOTE: If parameterstartLineNum
is1, 0
then the parse will begin from the top/start of webpage.
EXCEPTIONS: See classScrape
for methodStringBuffer getHTML(...)
for more information regarding what would cause invalid line numbers to generate exception throws.int endLineNum
Same as above, but this parameter is passed to int 'endLineNum'
inside methodScrape.getHTML(int startLineNum, int endLineNum)
NOTE: If parameterendLineNum
is negative, then the HTML data will be read and parsed until EOF is encountered.
EXCEPTIONS: See classScrape
for methodStringBuffer getHTML(...)
for more information regarding what would cause invalid line numbers to generate exception throws.String startTag
Same as above, but this parameter is passed to String 'startTag'
inside methodScrape.getHTML(BufferedReader br, String startTag, String endTag)
EXCEPTIONS: See classScrape
for methodStringBuffer getHTML(...)
for more information regarding what would cause invalid line numbers to generate exception throws.String endTag
Same as above, but this parameter is passed to String 'endTag'
inside methodScrape.getHTML(String startTag, String endTag)
EXCEPTIONS: (Again) See classScrape
for methodStringBuffer getHTML(...)
for more information regarding what would cause invalid line numbers to generate exception throws.String rawHTMLFile
When this parameter is included in the method-signature parameter list, all HTML retrieved from the web-server will be copied/dumped directly to a flat-file on the file-system named by this String 'rawHTMLFile.'
NOTE: For any one of the following these three parameters below, if a value of'null'
is passed for the value of the file-name, that set of data will not be retrieved and a file by that name will not be saved. This can be useful, say for example, when only the regex data needs to be reviewed, but not the raw-HTML page-data.String matchesFile
When this parameter is included, all regular-expression matcher information that is generated by the parser will be copied/sent to a flat-file on the file-system with this name 'matchesFile.'
This data may be used for debugging code. Generally, this information is not very useful, except for understanding regex. It is, however, kept here in these methods, available, for legacy purposes. The earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.String justTextFile
When this parameter is included in the method-signature parameter list, all TextNode
that are generated by the parser will be copied/dumped directly to a flat-file with the name in String'justTextFile.'
This data may be used for quickly scanning the content of a webpage, but generally is not very useful. It is kept here for legacy purposes, and the earliest debugging of these scrape-package classes used these flat-files quite frequently for testing.Return Values:
All methods return anVector<HTMLNode>
and this represents a vectorized-HTML page whose elements are the parsed content of the web-page that served as input to thegetPageTokens(...)
method that you selected.- See Also:
Scrape.getHTML(BufferedReader, int, int)
,Scrape.getHTML(BufferedReader, String, String)
,HTMLPage
Hi-Lited Source-Code:This File's Source Code:
- View Here: Torello/HTML/HTMLPageMWT.java
- Open New Browser-Tab: Torello/HTML/HTMLPageMWT.java
File Size: 24,669 Bytes Line Count: 531 '\n' Characters Found
HTML Regular-Expression Parser Class:
- View Here: ../Parse and Scrape/parser/ParserRE.java
- Open New Browser-Tab: ../Parse and Scrape/parser/ParserRE.java
File Size: 2,612 Bytes Line Count: 73 '\n' Characters Found
HTML Parser, Inner-Loop Class:
- View Here: ../Parse and Scrape/parser/ParserREInternal.java
- Open New Browser-Tab: ../Parse and Scrape/parser/ParserREInternal.java
File Size: 9,775 Bytes Line Count: 216 '\n' Characters Found
HTML Regular-Expressions Class:
- View Here: ../Parse and Scrape/parser/HTMLRegEx.java
- Open New Browser-Tab: ../Parse and Scrape/parser/HTMLRegEx.java
File Size: 1,809 Bytes Line Count: 27 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 13 Method(s), 13 declared static
- 3 Field(s), 3 declared static, 2 declared final
- Fields excused from final modifier (with explanation):
Field 'parser' is not final. Reason: SINGLETON
-
-
Field Summary
Fields Modifier and Type Field static HTMLPage.Parser
parser
-
Method Summary
Standard Parse Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags)
static void
shutdownMWTThreads()
Standard Parse, w/ Debug Files Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags, String rawHTMLFile, String matchesFile, String justTextFile)
Page Limited: Line-Number Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum)
Page Limited: Line-Number, w/ Debug Files Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, String rawHTMLFile, String matchesFile, String justTextFile)
Page Limited: Start & End Substring Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags, String startTag, String endTag)
Page Limited: Start & End String, w/ Debug Files Modifier and Type Method static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, BufferedReader br, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
static Vector<HTMLNode>
getPageTokens(long timeout, TimeUnit unit, URL url, boolean eliminateHTMLTags, String startTag, String endTag, String rawHTMLFile, String matchesFile, String justTextFile)
-
-
-
Field Detail
-
parser
public static HTMLPage.Parser parser
If needing to "swap a proprietary parser" comes up, this is possible. It just needs to accept the same parameters as the current parser, and produce aVector<HTMLNode>.
This is not an advised step to take, but if an alternative parser has been tested and happens to be generating different results, it can be easily 'swapped out' for the one used now.- See Also:
HTMLPage.Parser
,HTMLPage.Parser.parse(CharSequence, boolean, String, String, String)
- Code:
- Exact Field Declaration Expression:
public static Parser parser = ParserRE::parsePageTokens;
-
-
Method Detail
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:URL
and Time-Out Parameters'timeout' & 'unit'
Passes null to parametersstartTag, endTag, rawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, URL, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens(timeout, unit, url, eliminateHTMLTags, null, null, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:URL
and Time-Out Parameters'timeout' & 'unit'
And-Accepts:'startTag'
and'endTag'
Passes null to parametersrawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, URL, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, url, eliminateHTMLTags, startTag, endTag, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:URL
and Time-Out Parameters'timeout' & 'unit'
And-Accepts:'startLineNum'
and'endLineNum'
Passes null to parametersrawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, URL, boolean, int, int, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, url, eliminateHTMLTags, startLineNum, endLineNum, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:URL
and Time-Out Parameters'timeout' & 'unit'
Passes null tostartTag
&endTag
parameters.
Invokes:getPageTokens(long, TimeUnit, URL, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens( timeout, unit, url, eliminateHTMLTags, null, null, rawHTMLFile, matchesFile, justTextFile );
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:BufferedReader
Passes null to parametersstartTag, endTag, rawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, BufferedReader, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, br, eliminateHTMLTags, null, null, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:BufferedReader
And-Accepts:'startTag'
and'endTag'
Passes null to parametersrawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, BufferedReader, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, br, eliminateHTMLTags, startTag, endTag, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:BufferedReader
And-Accepts:'startLineNum'
and'endLineNum'
Passes null to parametersrawHTMLFile, matchesFile & justTextFile
.
Invokes:getPageTokens(long, TimeUnit, BufferedReader, boolean, int, int, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, br, eliminateHTMLTags, startLineNum, endLineNum, null, null, null);
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Convenience Method
Accepts:BufferedReader
Passes null tostartTag
&endTag
parameters.
Invokes:getPageTokens(long, TimeUnit, BufferedReader, boolean, String, String, String, String, String)
- Code:
- Exact Method Body:
return getPageTokens (timeout, unit, br, eliminateHTMLTags, null, null, rawHTMLFile, matchesFile, justTextFile);
-
shutdownMWTThreads
public static void shutdownMWTThreads()
If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.
Max Wait Time operates by building a "Timeout & Monitor" thread, and therefore when a program you have written yourself reaches the end of its code, if you have performed any Internet-Downloads usingclass HTMLPageMWT
, then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread, created in class HTMLPageMWT, dies.
Multi-Threaded:
You may also immediately terminate any additional threads that were started by using this method.
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Parses and Vectorizes HTML from aBufferedReader
source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.- Parameters:
timeout
- This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.unit
- The value passed to parameter 'timeout' is measured in units of time using javaclass java.util.concurrent.TimeUnit
.br
- ThisBufferedReader
will be scanned, and the HTML saved to aString
. Then it is parsed intoHTMLNode's
and returned as an HTMLVector
.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.startTag
- If this parameter is non-null, the scrape-logic will skip all content before finding the substring'startTag'
. Parsing HTML will not begin until this token is identified somewhere in the input-source.endTag
- If this parameter is non-null, the scrape-logic will skip all content after the substring'endTag'
is identified in the input-source.rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
ScrapeException
- If eitherstartTag
orendTag
are non-null, but also not-found on the input-page.java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).java.lang.InterruptedException
- This exception throws if the web-page downloadThread
is interrupted while downloading. Note that this, likeIOException
, is a checked exception, and must be caught.java.util.concurrent.RejectedExecutionException
- This is thrown if the javaThread
processing system fails to run the downloadThread
, or the monitorThread
. This is an unchecked,RuntimeException
.- Code:
- Exact Method Body:
Callable<Vector<HTMLNode>> threadDownloader = new Callable<Vector<HTMLNode>>() { public Vector<HTMLNode> call() throws Exception { return parser.parse( Scrape.getHTML(br, startTag, endTag), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile ); } }; lock.lock(); Future<Vector<HTMLNode>> future = executor.submit(threadDownloader); lock.unlock(); try { return future.get(timeout, unit); } catch (TimeoutException e) { return null; } catch (ExecutionException e) { Throwable originalException = e.getCause(); if (originalException == null) throw new RejectedExecutionException( "An Execution Exception was thrown, but it did provide a cause throwable " + "(e.getCause() returned null). See this exception's getCause() method to " + "view the ExecutionException that has occurred.", e ); if (originalException instanceof IOException) throw (IOException) originalException; if (originalException instanceof RuntimeException) throw (RuntimeException) originalException; throw new RejectedExecutionException( "An Execution Exception occurred, but it was neither a RuntimeException, " + "nor IOException. See this exception's getCause() method to view the " + "underlying error that has occurred.", originalException ); }
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.io.BufferedReader br, boolean eliminateHTMLTags, int startLineNum, int endLineNum, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Parses and Vectorizes HTML from aBufferedReader
source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.- Parameters:
timeout
- This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.unit
- The value passed to parameter 'timeout' is measured in units of time using javaclass java.util.concurrent.TimeUnit
.br
- ThisBufferedReader
will be scanned, and the HTML saved to aString
. Then it is parsed intoHTMLNode's
and returned as an HTMLVector
.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.startLineNum
- This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector
whose line-number is before'startLineNum'
endLineNum
- This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector
whose line-number is after'endLineNum'
rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
java.lang.IllegalArgumentException
- if parameterstartLineNum
is negative, orendLineNum
is beforestartLineNum
.ScrapeException
- If eitherstartLineNum
orendLineNum
are integers greater than the number of lines on the web-page (or sub-page).java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).java.lang.InterruptedException
- This exception throws if the web-page downloadThread
is interrupted while downloading. Note that this, likeIOException
, is a checked exception, and must be caught.java.util.concurrent.RejectedExecutionException
- This is thrown if the javaThread
processing system fails to run the downloadThread
, or the monitorThread
. This is an unchecked,RuntimeException
.- Code:
- Exact Method Body:
Callable<Vector<HTMLNode>> threadDownloader = new Callable<Vector<HTMLNode>>() { public Vector<HTMLNode> call() throws Exception { return parser.parse( Scrape.getHTML(br, startLineNum, endLineNum), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile ); } }; lock.lock(); Future<Vector<HTMLNode>> future = executor.submit(threadDownloader); lock.unlock(); try { return future.get(timeout, unit); } catch (TimeoutException e) { return null; } catch (ExecutionException e) { Throwable originalException = e.getCause(); if (originalException == null) throw new RejectedExecutionException( "An Execution Exception was thrown, but it did provide a cause throwable " + "(e.getCause() returned null). See this exception's getCause() method to " + "view the ExecutionException has that occurred.", e ); if (originalException instanceof IOException) throw (IOException) originalException; if (originalException instanceof RuntimeException) throw (RuntimeException) originalException; throw new RejectedExecutionException( "An Execution Exception occurred, but it was neither a RuntimeException, nor " + "IOException. See this exception's getCause() method to view the underlying " + "error that has occurred.", originalException ); }
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags, java.lang.String startTag, java.lang.String endTag, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.- Parameters:
timeout
- This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.unit
- The value passed to parameter 'timeout' is measured in units of time using javaclass java.util.concurrent.TimeUnit
.url
- ThisURL
will be scraped, and the HTML saved to aString
. Afterwards, theString
will be parsed into an HTMLVector
, and returned.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.startTag
- If this parameter is non-null, the scrape-logic will skip all content before finding the substring'startTag'
. Parsing HTML will not begin until this token is identified somewhere in the input-source.endTag
- If this parameter is non-null, the scrape-logic will skip all content after the substring'endTag'
is identified in the input-source.rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
ScrapeException
- If eitherstartTag
orendTag
are non-null, but also not-found on the input-page.java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).java.lang.InterruptedException
- This exception throws if the web-page downloadThread
is interrupted while downloading. Note that this, likeIOException
, is a checked exception, and must be caught.java.util.concurrent.RejectedExecutionException
- This is thrown if the javaThread
processing system fails to run the downloadThread
, or the monitorThread
. This is an unchecked,RuntimeException
.- Code:
- Exact Method Body:
Callable<Vector<HTMLNode>> threadDownloader = new Callable<Vector<HTMLNode>>() { public Vector<HTMLNode> call() throws Exception { return parser.parse( Scrape.getHTML(Scrape.openConn(url), startTag, endTag), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile ); } }; lock.lock(); Future<Vector<HTMLNode>> future = executor.submit(threadDownloader); lock.unlock(); try { return future.get(timeout, unit); } catch (TimeoutException e) { return null; } catch (ExecutionException e) { Throwable originalException = e.getCause(); if (originalException == null) throw new RejectedExecutionException( "An Execution Exception was thrown, but it did provide a cause throwable " + "(e.getCause() returned null). See this exception's getCause() method to " + "view the ExecutionException that has occurred.", e ); if (originalException instanceof IOException) throw (IOException) originalException; if (originalException instanceof RuntimeException) throw (RuntimeException) originalException; throw new RejectedExecutionException( "An Execution Exception occurred, but it was neither a RuntimeException, " + "nor IOException. See this exception's getCause() method to view the " + "underlying error that has occurred.", originalException ); }
-
getPageTokens
public static java.util.Vector<HTMLNode> getPageTokens (long timeout, java.util.concurrent.TimeUnit unit, java.net.URL url, boolean eliminateHTMLTags, int startLineNum, int endLineNum, java.lang.String rawHTMLFile, java.lang.String matchesFile, java.lang.String justTextFile) throws java.io.IOException, java.lang.InterruptedException
Parses and Vectorizes HTML from a URL source. Spawns a monitor-thread that stops the download if a certain, user-specified, time-limit is exceeded.- Parameters:
timeout
- This is the amount of time the program will wait for web-content to download, before cutting the connection - and returning null. If null is returned, it must mean the connection 'timed-out' according to this specified timeout duration.unit
- The value passed to parameter 'timeout' is measured in units of time using javaclass java.util.concurrent.TimeUnit
.url
- ThisURL
will be scraped, and the HTML saved to aString
. Afterwards, theString
will be parsed into an HTMLVector
, and returned.eliminateHTMLTags
- When this parameter is TRUE, allTagNode
andCommentNode
elements are eliminated from the returned HTMLVector
. AVector
having only the page-text (as instances ofTextNode
) is returned, instead.startLineNum
- This parameter allows a programmer to prevent any content on the page (or sub-page) from being retrieved (or parsed) into the return-Vector
whose line-number is before'startLineNum'
endLineNum
- This parameter allows a programmer to prevent any content on the web-page (or sub-page) from being retrieved (or parsed) into the return-Vector
whose line-number is after'endLineNum'
rawHTMLFile
- If this parameter is non-null, an identical copy of the HTML that is retrieved will be saved (as a text-file) to the file named by parameter'rawHTMLFile'
. If this parameter is null, it will be ignored (and the raw-HTML discarded).matchesFile
- If this parameter is non-null, a parser-output file, consisting of the regular-expression matches obtained while parsing the HTML, will be saved to disk using this file-name. This is a legacy feature, which can be helpful when debugging and investigating the contents of output HTML-Vector's
. This parameter may be null, and if it is, Regular-Expression Match Data will simply be discarded by the parser, after use.justTextFile
- If this parameter is non-null, a copy of the each and every character of text found on the downloaded web-page - that is not inside of an HTMLTagNode
orCommentNode
- will be saved to disk using this file-name. This is also a legacy feature. The text-file generated makes it easy to quickly scan the words that would be displayed on the page. If this parameter is null, it will be ignored.- Returns:
- A
Vector
ofHTMLNode's
(called 'Vectorized HTML') that represents the available parsed-content provided by the input-source. - Throws:
java.lang.IllegalArgumentException
- if parameterstartLineNum
is negative, orendLineNum
is beforestartLineNum
.ScrapeException
- If eitherstartLineNum
orendLineNum
are integers greater than the number of lines on the web-page (or sub-page).java.io.IOException
- This exception throws if there are any problems while processing the input-source HTML content (or writing output, if any).java.lang.InterruptedException
- This exception throws if the web-page downloadThread
is interrupted while downloading. Note that this, likeIOException
, is a checked exception, and must be caught.java.util.concurrent.RejectedExecutionException
- This is thrown if the javaThread
processing system fails to run the downloadThread
, or the monitorThread
. This is an unchecked,RuntimeException
.- Code:
- Exact Method Body:
Callable<Vector<HTMLNode>> threadDownloader = new Callable<Vector<HTMLNode>>() { public Vector<HTMLNode> call() throws Exception { return parser.parse( Scrape.getHTML(Scrape.openConn(url), startLineNum, endLineNum), eliminateHTMLTags, rawHTMLFile, matchesFile, justTextFile ); } }; lock.lock(); Future<Vector<HTMLNode>> future = executor.submit(threadDownloader); lock.unlock(); try { return future.get(timeout, unit); } catch (TimeoutException e) { return null; } catch (ExecutionException e) { Throwable originalException = e.getCause(); if (originalException == null) throw new RejectedExecutionException( "An Execution Exception was thrown, but it did provide a cause throwable " + "(e.getCause() returned null). See this exception's getCause() method to " + "view the ExecutionException has that occurred.", e ); if (originalException instanceof IOException) throw (IOException) originalException; if (originalException instanceof RuntimeException) throw (RuntimeException) originalException; throw new RejectedExecutionException( "An Execution Exception occurred, but it was neither a RuntimeException, nor " + "IOException. See this exception's getCause() method to view the underlying " + "error that has occurred.", originalException ); }
-
-