Package Torello.HTML.Tools.NewsSite
Interface ArticleGet
-
- All Superinterfaces:
java.io.Serializable
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
@FunctionalInterface public interface ArticleGet extends java.io.Serializable
A function-pointer / lambda target for extracting an article's content from the web-page from whence it was downloaded; including severalstatic
-builder methods for the most common means of finding the HTML-Tags that wrap artilce-HTML on news-media websites.
The purpose of this Java Predicate Function is to "get" an article-content-body out of the complete-HTML page in which it resides. For all intents-and-purposes, this is a highly-trivial coding-problem, but different for just about every news-site on the internet.
Generally, all that is needed to implement an instance of classArticleGet
is to provide the needed parameters to one of several factory-methods in this class. This class contains several'static'
methods named'usual(...)'
that accept typicalNodeSearch
-Package parameters for extracting a partial Web-Ppage out of a complete one.
Primarily, the use of class'ArticleGet'
is such that after a list of News-ArticleURL's
have been built from an online News-Based Web-site, thoseArticle's
are processed quickly. This is accomplished by immediately removing all extranneous HTML and concentrating on the Article and Header itself only.
For example, on Yahoo! News, downloading any one of the myriad Yahoo! Articles, one will encounter lists upon lists of "related news", advertisements, links to other sections of the site, and even User-Comments. The Article-Body itself - usually including the Title, Author and Story-Photos - is easily retrieved by looking for the HTML Tag"<ARTICLE ...>
To retrieve the contents of the<ARTICLE> ... </ARTICLE>
construct, simply make a call to theNodeSearch
-Package methodTagNodeGetInclusive.first(fullPage, "article")
. It will retrieve the entire Article Content in a single Line of Code!
Below are some examples of how to build an instance of'ArticleGet'
such that it may be passed to the methodScrapeArticles.download(...)
Generally, this automates the sometime laborious process of scraping an entire News Web-Site for a day's entire set of articles.
Example:
// This example would take a page copied from a URL on a news-site, and eliminate everything except the // HTMLNode's that were between the DIV whose class attribute are: // <DIV ... class="body-content"> article ... [HTMLNodes] ... </DIV> // This uses java's lambda syntax to build the ArticleGet instance ArticleGet ag = (URL url, Vector<HTMLNode> page) -> InnerTagGetInclusive.first (page, "div", "class", TextComparitor.C, "body-content"); // The behaviour of this ArticleGetter will be identical to the one, manually built, above. // Here, a pre-defined "factory builder" method is used instead: ArticleGet ag2 = ArticleGet.usual("div", "class", TextComparitor.C, "body-content");
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/ArticleGet.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/ArticleGet.java
File Size: 46,714 Bytes Line Count: 930 '\n' Characters Found
-
-
Field Summary
Serializable ID Modifier and Type Field static long
serialVersionUID
-
Method Summary
@FunctionalInterface: (Lambda) Method Modifier and Type Method Vector<HTMLNode>
apply(URL url, Vector<HTMLNode> page)
Methods: Static Factory-Builder, Typical Modifier and Type Method static ArticleGet
usual(String htmlTag)
static ArticleGet
usual(String startTextTag, String endTextTag)
static ArticleGet
usual(String htmlTag, String innerTag, Predicate<String> p)
static ArticleGet
usual(String htmlTag, String innerTag, Pattern innerTagValuePattern)
static ArticleGet
usual(String htmlTag, String innerTag, TextComparitor tc, String... attributeValueCompareStrings)
static ArticleGet
usual(Pattern startPattern, Pattern endPattern)
static ArticleGet
usual(TextComparitor tc, String... cssClassCompareStrings)
Methods: Static Factory-Builder, Additional Modifier and Type Method static ArticleGet
branch(URLFilter[] urlSelectors, ArticleGet[] getters)
static ArticleGet
identity()
Methods: Default Composition & Builder Modifier and Type Method default ArticleGet
andThen(ArticleGet after)
default ArticleGet
compose(ArticleGet before)
Internal Helper Modifier and Type Method static String
STR_FORMAT_TC_PARAMS(TextComparitor tc, String... compareStrings)
-
-
-
Field Detail
-
serialVersionUID
static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable
. Using theSerializable
Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.
Functional Interfaces are usually not thought of as Data Objects that need to be saved, stored and retrieved; however, having the ability to store intermediate results along with the lambda-functions that helped get those results can make debugging easier.- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final long serialVersionUID = 1;
-
-
Method Detail
-
apply
java.util.Vector<HTMLNode> apply(java.net.URL url, java.util.Vector<HTMLNode> page) throws ArticleGetException
FunctionalInterface Target-Method:
This method corresponds to the@FunctionalInterface
Annotation's method requirement. It is the only non-default
, non-static
method in this interface, and may be the target of a Lambda-Expression or'::'
(double-colon) Function-Pointer.
This method's purpose is to take a "Scraped HTML Page" (stored as a Vectorized-HTML Web-Page), and return an HTMLVector
that contains only the "Article Content" - which is usually just called the "Article Body." Perhaps it seems daunting, but the usual way to get the actual article-body of an HTML News-Website Page is to simply identify anHTML <DIV ID="..." CLASS="...">
surrounding element.
This class has several different static-methods called "usual" which automatically create a page-getter. The example at the top of this class should hiLite how this works. Extracting news-content from a page that has already been downloaded - is usually trivial. The point really becomes identifying the<DIV>
'sclass=...
orid=...
attributes & page-structure to find the article-body. Generally, in your browser just click theView Source
and look at manually to find the attributes used. Using the myriad Get methods fromTorello.HTML.NodeSearch
usually boils down to code that looks surreptitiously like Java-Script:
JavaScript:
var articleHTML = document.getElementById("article-body").innerHTML; // or... var articleHTML = document.getElementByClassName("article-body").innerHTML;
Using theNodeSearch
package, the above DOM-Tree Java-Script is easily written in Java as below:
// For articles with HTML divider elements having an "ID" attribute to specify the article // body, get the article using the code below. In this example, the particular newspaper // web-site has articles whose content ("Article Body") is simply wrapped in an HTML // HTML Divider Element: <DIV ID="article-body"> ... </DIV> // For extracting that content use the NodeSearch Package Class: InnerTagGetInclusive Vector<HTMLNode> articleBody = InnerTagGetInclusive (page, "div", "id", TextComparitor.EQ_CI, "article-body"); // To use this NodeSearch Package Class with the NewsSite Package, simply use one of the // 'usual' methods in class ArticleGet, and the lambda Functional Interface "ArticleGet" // will be built automatically as such: ArticleGet getter = ArticleGet.usual("div", "id", TextComparitor.EQ_CI, "article-body"); // For articles with HTML divider elements having an "CLASS" attribute to specify // the article body, get the article with the following code. Note that in this example // the article body is wrapped in an HTML Divider Element that has the characteristics // <DIV CLASS="article-body"> ... </DIV>. The content of a Newspaper Article can be easily // extracted with just one line of code using the methods in the NodeSearch Package as // follows: Vector<HTMLNode> articleBody = InnerTagGetInclusive (page, "div", "class", TextComparitor.C, "article-body"); // which should be written for use with the ScrapeArticles class as using the 'usual' // methods in ArticleGet as such: ArticleGet getter = ArticleGet.usual(TextComparitor.EQ_CI, "article-body");
NOTE: For all examples above, the text-string "article-body" will be a tag-value that (was) decided/chosen by the HTML news-website, or content-website you want to scrape.
ALSO: One might have to be careful about modifying the input to thisPredicate
. Each and every one of the NodeSearch classes retrieves a copy (read: a clone) of the inputVector
(other than the classes that actually use the term "remove.") However, if you were to write an Article Get lambda of your own (rather than using the "usual" methods), make sure you know whether you are going to intentionally, modify the input-page, and if so, remember you have.
FURTHERMORE: There are many content-based web-sites that have some (even "a lot") of spurious HTML information inside the primary article body, even after the header & footer information has been eliminated. It may be necessary to do some vector-cleaning later on. For example: getting rid of "Post to Facebook", "Post to Twitter" or "E-Mail Link" buttons.- Throws:
ArticleGetException
-
usual
static ArticleGet usual(java.lang.String htmlTag)
This is a static, factory method for building ArticleGet.
This builds an "Article Getter" based on a parameter-specified HTML Tag. Two or three common HTML "semantic elements" used for wrapping newspaper article-content include these:<ARTICLE ...> article-body </ARITCLE>
<MAIN ...> article-body </MAIN>
<SECTION ...> article-body </SECTION>
Identifying which tag to use can be accomplished by going to the main-page of an internet news web-site, selecting a news-article, and then using the"View Source"
or the"View Page Source"
depending upon which browser your are using, and then scanning the HTML to find what elements are used to wrap the article-body.
Call this method, and use the ArticleGet that it generates/returns with theclass NewsSiteScrape
. As long as the news or content website that you are scraping has it's page-body wrapped inside of anHTML <DIV>
element whoseCSS 'class'
specifier is one you have uncovered by inspecting the page-manually thenArticleGet
produced by this factory-method will retrieve your page content appropriately.- Parameters:
htmlTag
- This should be the HTML element that is used to wrap the actual news-content article-body of an HTML news web-site page.- Returns:
- This returns an "Article Getter" that just picks out the part of a news-website article that lies between the open and closed version of the specified htmlTag.
- Code:
- Exact Method Body:
final String htmlTagLC = htmlTag.toLowerCase(); // This 'final String' is merely used for proper error reporting in any potential // exception-messages, nothing else. final String functionNameStr = "TagNodeGetInclusive.first(page, \"" + htmlTagLC + "\");"; // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Check for valid HTML Token HTMLTokException.check(htmlTagLC); // Self-Closing / Singleton Tags CANNOT be used with INCLUSIVE Retrieval Operations. InclusiveException.check(htmlTagLC); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invocation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occurred. ArticleGetException.check(url, page); Vector<HTMLNode> ret; try { ret = TagNodeGetInclusive.first(page, htmlTagLC); } catch (Exception e) { throw new ArticleGetException (ArticleGetException.GOT_EXCEPTION, functionNameStr, e); } // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case, the "innerHTML" of the specified htmlTag was // not found, and produced a null news-article page, or an empty news-article page. if (ret == null) throw new ArticleGetException (ArticleGetException.RET_NULL, functionNameStr); if (ret.size() == 0) throw new ArticleGetException (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr); return ret; };
-
usual
static ArticleGet usual(TextComparitor tc, java.lang.String... cssClassCompareStrings)
This is a static, factory method for building ArticleGet.
This builds an "Article Getter" for you, using the most common way to get an article - specifically via theHTML <DIV CLASS="...">
element and it'sCSS 'class'
selector.
Call this method, and use the ArticleGet that it generates/returns with theclass NewsSiteScrape
. As long as the news or content website that you are scraping has it's page-body wrapped inside of anHTML <DIV>
element whoseCSS 'class'
specifier is one you have uncovered by inspecting the page-manually thenArticleGet
produced by this factory-method will retrieve your page content appropriately.- Parameters:
tc
- This should be any of the pre-instantiatedTextComparitor's
. Again, a TextComparitor is just aString
compare function like:equals, contains, StrCmpr.containsIgnoreCase(...)
, etc...cssClassCompareStrings
- These are the values to be used by theTextComparitor
when comparing with the value of the CSS-Selector"Class"
from the list ofDIV
elements on the page.- Returns:
- This returns an "Article Getter" that just picks out the part of a news-website
article that lies between the HTML-
DIV
Element nodes whose class is identified by the "CSS (Cascading Style Sheets)'class'
identifier, and theTextComparitor
parameter that you have chosen. - Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Check for valid compareStrings TCCompareStrException.check(cssClassCompareStrings); if (tc == null) throw new NullPointerException ("Null has been passed to TextComparitor Parameter 'tc', but this is not allowed here."); // This 'final' String is merely used for proper error reporting in any potential // exception-messages, nothing else. final String functionNameStr = "InnerTagGetInclusive.first(page, \"div\", \"class\", " + STR_FORMAT_TC_PARAMS(tc, cssClassCompareStrings) + ")"; // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invocation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occurred. ArticleGetException.check(url, page); Vector<HTMLNode> ret; try { ret = InnerTagGetInclusive.first (page, "div", "class", tc, cssClassCompareStrings); } catch (Exception e) { throw new ArticleGetException (ArticleGetException.GOT_EXCEPTION, functionNameStr, e); } // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case, the "innerHTML" of the specified htmltag and // class of the <DIV CLASS=...> produced a null news-article page, or an empty // news-article page. if (ret == null) throw new ArticleGetException (ArticleGetException.RET_NULL, functionNameStr); if (ret.size() == 0) throw new ArticleGetException (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr); return ret; };
-
usual
static ArticleGet usual(java.lang.String htmlTag, java.lang.String innerTag, TextComparitor tc, java.lang.String... attributeValueCompareStrings)
This is a static, factory method for building ArticleGet.
This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the<DIV CLASS="...">
can be found by theCSS 'class'
attribute. However, This factory method allows a programmer to select article content that handles other cases than the95%
, where you specify the HTML-token, attribute-name and use the usualTextComparitor
to find the article.- Parameters:
htmlTag
- This is almost always a"DIV"
element, but if you wish to specify something else, possibly a paragraph element (<P>
), or maybe an<IFRAME>
or<FRAME>
, then you may.innerTag
- This is almost always a"CLASS"
attribute, but if you need to use"ID"
or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.tc
- This should be any of the pre-instantiatedTextComparitor's
. Again, aTextComparitor
is just aString
compare function like:equals, contains, StrCmpr.containsIgnoreCase(...)
.attributeValueCompareStrings
- These are theString's
compared with using the innerTag value using theTextComparitor
.- Returns:
- This returns an "Article Getter" that picks out the part of a news-website article
that lies between the HTML element which matches the
'htmlTag', 'innerTag' (id, class, or "other")
, and whose attribute-value of the specifiedinner-tag
can be matched by theTextComparitor
and the compare-String's
. - Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** TCCompareStrException.check(attributeValueCompareStrings); if (tc == null) throw new NullPointerException ("Null has been passed to TextComparitor Parameter 'tc', but this is not allowed here."); final String htmlTagLC = htmlTag.toLowerCase(); final String innerTagLC = innerTag.toLowerCase(); // This 'final String' is merely used for proper error reporting in any potential // exception-messages, nothing else. final String functionNameStr = "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " + STR_FORMAT_TC_PARAMS(tc, attributeValueCompareStrings) + ")"; // Check for valid HTML Tag. HTMLTokException.check(htmlTagLC); // Self-Closing / Singleton Tags CANNOT be used with INCLUSIVE Retrieval Operations. InclusiveException.check(htmlTagLC); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invocation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occurred. ArticleGetException.check(url, page); Vector<HTMLNode> ret; try { ret = InnerTagGetInclusive.first (page, htmlTagLC, innerTagLC, tc, attributeValueCompareStrings); } catch (Exception e) // unlikely { throw new ArticleGetException (ArticleGetException.GOT_EXCEPTION, functionNameStr, e); } // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case, the "innerHTML" of the specified htmlTag and // attribute produced a null news-article page, or an empty news-article page. if (ret == null) throw new ArticleGetException (ArticleGetException.RET_NULL, functionNameStr); if (ret.size() == 0) throw new ArticleGetException (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr); return ret; };
-
usual
static ArticleGet usual(java.lang.String htmlTag, java.lang.String innerTag, java.util.regex.Pattern innerTagValuePattern)
This is a static, factory method for building ArticleGet.
This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the<DIV CLASS="...">
can be found by theCSS 'class'
attribute. However, This factory method allows a programmer to select article content that handles other cases than the95%
. Here, you may specify the HTML-token, attribute-name and use a Java Regular-Expression handler to test the value of the attribute - no matter how complicated or bizarre.- Parameters:
htmlTag
- This is almost always a"DIV"
element, but if you wish to specify something else, possibly a paragraph element (<P>
), or maybe an<IFRAME>
or<FRAME>
, then you may.innerTag
- This is almost always a"CLASS"
attribute, but if you need to use"ID"
or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.innerTagValuePattern
- Any regular-expression. It will be used to PASS or FAIL the attribute-value (a name that is used interchangeably in this scrape/search package for "inner-tag-value") when compared against this regular-expression parameter.
HELP: This would be like saying:
// Pick some random HTML TagNode TagNode aTagNode = (TagNode) page.elementAt(index_to_test); // Gets the attribute value of "innerTag" String attributeValue = aTagNode.AV(innerTag); // Make sure the HTML-token is as specified // calls to: java.util.regex.*; boolean passFail = aTagNode.tok.equals(htmlTag) && innerTagValuePattern.matcher(attributeValue).find();
- Returns:
- This returns an "Article Getter" that picks out the part of a news-website article
that lays between the HTML element which matches the htmlTag, innerTag and value-testing
regex
Pattern "innerTagValuePattern"
. - Code:
- Exact Method Body:
final String htmlTagLC = htmlTag.toLowerCase(); final String innerTagLC = innerTag.toLowerCase(); // This 'final String' is merely used for proper error reporting in any potential // exception-messages, nothing else. final String functionNameStr = "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " + innerTagValuePattern.pattern() + ")"; // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** HTMLTokException.check(htmlTagLC); InclusiveException.check(htmlTagLC); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invocation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occurred. ArticleGetException.check(url, page); Vector<HTMLNode> ret; try { ret = InnerTagGetInclusive.first (page, htmlTagLC, innerTagLC, innerTagValuePattern); } catch (Exception e) // unlikely { throw new ArticleGetException (ArticleGetException.GOT_EXCEPTION, functionNameStr, e); } // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case, the "innerHTML" of the specified htmlTag and // attribute produced a null news-article page, or an empty news-article page. if (ret == null) throw new ArticleGetException (ArticleGetException.RET_NULL, functionNameStr); if (ret.size() == 0) throw new ArticleGetException (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr); return ret; };
-
usual
static ArticleGet usual(java.lang.String htmlTag, java.lang.String innerTag, java.util.function.Predicate<java.lang.String> p)
This is a static, factory method for building ArticleGet.
This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML'DIV'
element, and the<DIV CLASS="...">
can be found by theCSS 'class'
attribute. However, This factory method allows a programmer to select article content that handles other cases than the95%
, where you specify the HTML-token, attribute-name and aPredicate<String>
for finding the page-body.- Parameters:
htmlTag
- This is almost always a"DIV"
element, but if you wish to specify something else, possibly a paragraph element (<P>
), or maybe an<IFRAME>
or<FRAME>
, then you may.innerTag
- This is almost always a"CLASS"
attribute, but if you need to use"ID"
or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.p
- This java "lambdaPredicate
" will just receive the attribute-value from the "inner-tag" and provide a yes/no answer.- Returns:
- This returns an "Article Getter" that matches an HTML element specified by
'htmlTag', 'innerTag'
and the result of theString-Predicate
parameter'p'
on the value of that inner-tag. - Code:
- Exact Method Body:
final String htmlTagLC = htmlTag.toLowerCase(); final String innerTagLC = innerTag.toLowerCase(); // This 'final' String is merely used for proper error reporting in any potential // exception-messages, nothing else. final String functionNameStr = "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " + "Predicate<String>)"; // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** HTMLTokException.check(htmlTagLC); InclusiveException.check(htmlTagLC); if (p == null) throw new NullPointerException ("Null has been passed to Predicate parameter 'p'. This is not allowed here."); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invocation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occurred. ArticleGetException.check(url, page); Vector<HTMLNode> ret; try { ret = InnerTagGetInclusive.first(page, htmlTagLC, innerTagLC, p); } catch (Exception e) { throw new ArticleGetException (ArticleGetException.GOT_EXCEPTION, functionNameStr, e); } // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case, the "innerHTML" of the specified htmlTag and // attribute produced a null news-article page, or an empty news-article page. if (ret == null) throw new ArticleGetException (ArticleGetException.RET_NULL, functionNameStr, null); if (ret.size() == 0) throw new ArticleGetException (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr, null); return ret; };
-
usual
static ArticleGet usual(java.lang.String startTextTag, java.lang.String endTextTag)
This is a static, factory method for building ArticleGet.
This factory method generates an "ArticleGet" that will retrieve news-article body-content based on a "start-tag" and an "end-tag." It is very to note, that the text can only match a single text-node, and not span multiple text-nodes, or be withinTagNode's
at all! This should be easy to find, print up the HTML page as aVector
, and inspect it!- Parameters:
startTextTag
- This must be text from an HTMLTextNode
that is contained within one (single)TextNode
of the vectorized-HTML page.endTextTag
- This must be text from an HTMLTextNode
that is also contained in a singleTextNode
of the vectorized-HTML page.- Returns:
- This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the text-tag parameters, and gets it.
- Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** if (startTextTag == null) throw new NullPointerException ("Null has been passed to parameter 'startTextTag', but this is not allowed here."); if (endTextTag == null) throw new NullPointerException ("Null has been passed to parameter 'endTextTag', but this is not allowed here."); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invokation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occured. ArticleGetException.check(url, page); int start = -1; int end = -1; HTMLNode n = null; while (start++ < page.size()) if ((n = page.elementAt(start)) instanceof TextNode) if (n.str.contains(startTextTag)) break; while (end++ < page.size()) if ((n = page.elementAt(end)) instanceof TextNode) if (n.str.contains(endTextTag)) break; // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case it is because the start/end tags were not found // in the text of the vectorized-html news-article web-page. if (start == page.size()) throw new ArticleGetException( "Start Text Tag [" + startTextTag + "], was not found on the News Article HTML " + "page." ); if (end == page.size()) throw new ArticleGetException( "End Text Tag [" + endTextTag + "], was not found on the News Article HTML " + "page." ); return Util.cloneRange(page, start, end + 1); };
-
usual
static ArticleGet usual(java.util.regex.Pattern startPattern, java.util.regex.Pattern endPattern)
This is a static, factory method for building ArticleGet. This factory method generates an "ArticleGet" that will retrieve news-article body-content based on starting and ending regular-expressions. The matches performed by the Regular Expression checker will be performed onTextNode's
, not on theTagNode's
, or the page itself. It is very to note, that the text can only match a singleTextNode
, and not span multipleTextNode's
, or be withinTagNode's
at all! This should be easy to find, print up the HTML page as aVector
, and inspect it!- Parameters:
startPattern
- This must be a regular expressionPattern
that matches an HTMLTextNode
that is contained within one (single)TextNode
of the vectorized-HTML page.endPattern
- This must be a regular expressionPattern
that matches an HTMLTextNode
that is also contained in a singleTextNode
of the vectorized-HTML page.- Returns:
- This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the regular-expression pattern-matching parameters, and gets it.
- Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** if (startPattern == null) throw new NullPointerException ("Null has been passed to parameter 'startPattern', but this is not allowed here."); if (endPattern == null) throw new NullPointerException ("Null has been passed to parameter 'endPattern', but this is not allowed here."); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { // This exception-check is done on every invokation of this Lambda-Function. // It is merely checking that these inputs are not-null, and page is of non-zero size. // ArticleGetException is a compile-time, checked exception. It is important to halt // News-Site Scrape Progress when "Empty News-Page Data" is being passed here. // NOTE: This would imply an internal-error with class Download has occured. ArticleGetException.check(url, page); int start = -1; int end = -1; HTMLNode n = null; while (start++ < page.size()) if ((n = page.elementAt(start)) instanceof TextNode) if (startPattern.matcher(n.str).find()) break; while (end++ < page.size()) if ((n = page.elementAt(end)) instanceof TextNode) if (endPattern.matcher(n.str).find()) break; // These error-checks are used to deduce whether the "Article Get" was successful. // When this exception is thrown, it means that the user-specified means of "Retrieving // an Article Body" FAILED. In this case it is because the start or end regex failed to // match. if (start == page.size()) throw new ArticleGetException( "Start Pattern [" + startPattern.toString() + "], was not found on the HTML " + "page." ); if (end == page.size()) throw new ArticleGetException ("End Pattern [" + endPattern.toString() + "], was not found on the HTML page."); return Util.cloneRange(page, start, end + 1); };
-
branch
static ArticleGet branch(URLFilter[] urlSelectors, ArticleGet[] getters)
This is a static, factory method for building ArticleGet. This is just a way to put a list of article-parse objects into a single "branching" article-parseObject
. The two parameters must be equal-length arrays, with non-null elements. Each'urlSelector'
will be tested, and when a selector passes, theArticleGet
that is created will use the "parallel getter" from the parallel array "getters."
LAY-SPEAK: The best way to summarize this is if a programmer is going to use theNewsSiteScrape
class, and planning to scrape a site that has different types of news-articles, he will need differing"ArticleGet"
methods. This class will take twoarray's
that match theURL
from which the article was retrieved with the particular "getter" method you have provided. When I scrape the address:http://www.baidu.com/
- a Chinese News Web-Site, it links to at least three primary domains:http://...chinesenews.com/director.../article...
http://...xinhuanet.com/director.../article...
http://...cctv.com/director.../article...
Results from each of these sites need to be "handled" just ever-so-slightly different.- Parameters:
urlSelectors
- This is a list ofPredicate<URL>
elements. When one of these returnsTRUE
for a particularURL
, then the index of thatURL
-selector in it'sarray
will be used to call the appropriate getter from the parallel-array
input-parameter'getters'
.getters
- This is a list of getter elements. These should be tailored to the particular news-website source that are chosen/selected by the'urlSelectors'
parallelarray
.- Returns:
- This will be a "master
ArticleGet
" or a "dispatchArticleGet
." All it does is simply traverse the firstarray
looking for aPredicate
-match from the'urlSelectors'
, and then calls the getter in the parallelarray
.
NOTE: If none of the'urlSelectors'
match when this "dispatch" or rather "branch" is called byclass NewsSiteScrape
, the function/getter that is returned will throw anArticleGetException
. It is important that the programmer only allow articleURL's
that he can capably handled to pass toclass NewsSiteScrape
. - Throws:
java.lang.IllegalArgumentException
- Will throw this exception if:- Either of these parameters are null
- If they are not parallel, with differing lengths.
- If either contain a null value.
- Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** if (urlSelectors.length == 0) throw new IllegalArgumentException ("parameter 'urlSelectors' had zero-elements."); if (getters.length == 0) throw new IllegalArgumentException ("parameter 'getters' had zero-elements."); ParallelArrayException.check(urlSelectors, "urlSelectors", true, getters, "getters", true); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build the instance, using a lambda-expression // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return (URL url, Vector<HTMLNode> page) -> { for (int i=0; i < urlSelectors.length; i++) if (urlSelectors[i].test(url)) return getters[i].apply(url, page); throw new ArticleGetException( "None of the urlSelecctors you have provided matched the URL sent to this " + "instance of ArticleGet." ); };
-
andThen
default ArticleGet andThen(ArticleGet after)
This is the standard-javaFunction 'andThen'
method.- Parameters:
after
- This is theArticleGet
that will be (automatically) applied after'this'
function.- Returns:
- A new, composite
ArticleGet
that performs both operations. It will:- Run
'this'
function's'apply'
method to aURL, Vector<HTMLNode>
, and return aVector<HTMLNode>
.
- Then it will run the
'after'
function's'apply'
method to the results of'this.apply(...)'
and return the result.
- Run
- Code:
- Exact Method Body:
return (URL url, Vector<HTMLNode> page) -> after.apply(url, this.apply(url, page));
-
compose
default ArticleGet compose(ArticleGet before)
This is the standard-javaFunction 'compose'
method.- Parameters:
before
- This is theArticleGet
that is performed first, whose results are sent to'this'
function.- Returns:
- A new composite
ArticleGet
that performs both operations. It will:- Run the
'before'
function's'apply'
method to aURL, Vector<HTMLNode>
, and return aVector<HTMLNode>
. - Then it will run
'this'
function's'apply'
method to the results of thebefore.apply(...)
and return the result.
- Run the
- Code:
- Exact Method Body:
return (URL url, Vector<HTMLNode> page) -> this.apply(url, before.apply(url, page));
-
identity
static ArticleGet identity()
The identity function will always return the sameVector<HTMLNode>
as output that it receives as input. This is one of thedefault
Java's lambda-methods.- Returns:
- a new
ArticleGet
which (it should be obvious) is of type:java.util.function.Function<Vector<HTMLNode>, Vector<HTMLNode>>
... where the returnedVector
is always the same (identical) to the inputVector
. - Code:
- Exact Method Body:
return (URL url, Vector<HTMLNode> page) -> { ArticleGetException.check(url, page); return page; };
-
STR_FORMAT_TC_PARAMS
static java.lang.String STR_FORMAT_TC_PARAMS (TextComparitor tc, java.lang.String... compareStrings)
Internally Used.
-
-