Interface ArticleGet

  • All Superinterfaces:
    java.io.Serializable
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public interface ArticleGet
    extends java.io.Serializable
    A function-pointer / lambda target for extracting an article's content from the web-page from whence it was downloaded; including several static-builder methods for the most common means of finding the HTML-Tags that wrap artilce-HTML on news-media websites.

    The purpose of this Java Predicate Function is to "get" an article-content-body out of the complete-HTML page in which it resides. For all intents-and-purposes, this is a highly-trivial coding-problem, but different for just about every news-site on the internet.

    Generally, all that is needed to implement an instance of class ArticleGet is to provide the needed parameters to one of several factory-methods in this class. This class contains several 'static' methods named 'usual(...)' that accept typical NodeSearch-Package parameters for extracting a partial Web-Ppage out of a complete one.

    Primarily, the use of class 'ArticleGet' is such that after a list of News-Article URL's have been built from an online News-Based Web-site, those Article's are processed quickly. This is accomplished by immediately removing all extranneous HTML and concentrating on the Article and Header itself only.

    For example, on Yahoo! News, downloading any one of the myriad Yahoo! Articles, one will encounter lists upon lists of "related news", advertisements, links to other sections of the site, and even User-Comments. The Article-Body itself - usually including the Title, Author and Story-Photos - is easily retrieved by looking for the HTML Tag "<ARTICLE ...>

    To retrieve the contents of the <ARTICLE> ... </ARTICLE> construct, simply make a call to the NodeSearch-Package method TagNodeGetInclusive.first(fullPage, "article"). It will retrieve the entire Article Content in a single Line of Code!

    Below are some examples of how to build an instance of 'ArticleGet' such that it may be passed to the method ScrapeArticles.download(...) Generally, this automates the sometime laborious process of scraping an entire News Web-Site for a day's entire set of articles.

    Example:
    // This example would take a page copied from a URL on a news-site, and eliminate everything except the
    // HTMLNode's that were between the DIV whose class attribute are:
    // <DIV ... class="body-content"> article ... [HTMLNodes] ... </DIV>
    
    // This uses java's lambda syntax to build the ArticleGet instance
    ArticleGet ag = (URL url, Vector<HTMLNode> page) ->
        InnerTagGetInclusive.first
            (page, "div", "class", TextComparitor.C, "body-content");
    
    // The behaviour of this ArticleGetter will be identical to the one, manually built, above.
    // Here, a pre-defined "factory builder" method is used instead:
    
    ArticleGet ag2 = ArticleGet.usual("div", "class", TextComparitor.C, "body-content");
    


    • Field Detail

      • serialVersionUID

        🡇     🗕  🗗  🗖
        static final long serialVersionUID
        This fulfils the SerialVersion UID requirement for all classes that implement Java's interface java.io.Serializable. Using the Serializable Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.

        Functional Interfaces are usually not thought of as Data Objects that need to be saved, stored and retrieved; however, having the ability to store intermediate results along with the lambda-functions that helped get those results can make debugging easier.
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         public static final long serialVersionUID = 1;
        
    • Method Detail

      • apply

        🡅  🡇     🗕  🗗  🗖
        java.util.Vector<HTMLNodeapply​(java.net.URL url,
                                         java.util.Vector<HTMLNode> page)
                                  throws ArticleGetException
        FunctionalInterface Target-Method:
        This method corresponds to the @FunctionalInterface Annotation's method requirement. It is the only non-default, non-static method in this interface, and may be the target of a Lambda-Expression or '::' (double-colon) Function-Pointer.

        This method's purpose is to take a "Scraped HTML Page" (stored as a Vectorized-HTML Web-Page), and return an HTML Vector that contains only the "Article Content" - which is usually just called the "Article Body." Perhaps it seems daunting, but the usual way to get the actual article-body of an HTML News-Website Page is to simply identify an HTML <DIV ID="..." CLASS="..."> surrounding element.

        This class has several different static-methods called "usual" which automatically create a page-getter. The example at the top of this class should hiLite how this works. Extracting news-content from a page that has already been downloaded - is usually trivial. The point really becomes identifying the <DIV>'s class=... or id=... attributes & page-structure to find the article-body. Generally, in your browser just click the View Source and look at manually to find the attributes used. Using the myriad Get methods from Torello.HTML.NodeSearch usually boils down to code that looks surreptitiously like Java-Script:


        JavaScript:
          var articleHTML = document.getElementById("article-body").innerHTML;
        
          // or...
          var articleHTML = document.getElementByClassName("article-body").innerHTML;
        

        Using the NodeSearch package, the above DOM-Tree Java-Script is easily written in Java as below:
          // For articles with HTML divider elements having an "ID" attribute to specify the article
          // body, get the article using the code below.  In this example, the particular newspaper
          // web-site has articles whose content ("Article Body") is simply wrapped in an HTML
          // HTML Divider Element: <DIV ID="article-body"> ... </DIV>
         
          // For extracting that content use the NodeSearch Package Class: InnerTagGetInclusive
        
          Vector<HTMLNode> articleBody = InnerTagGetInclusive
              (page, "div", "id", TextComparitor.EQ_CI, "article-body");
        
          // To use this NodeSearch Package Class with the NewsSite Package, simply use one of the
          // 'usual' methods in class ArticleGet, and the lambda Functional Interface "ArticleGet"
          // will be built automatically as such:
        
          ArticleGet getter = ArticleGet.usual("div", "id", TextComparitor.EQ_CI, "article-body");
        
          // For articles with HTML divider elements having an "CLASS" attribute to specify
          // the article body, get the article with the following code.  Note that in this example
          // the article body is wrapped in an HTML Divider Element that has the characteristics
          // <DIV CLASS="article-body"> ... </DIV>.  The content of a Newspaper Article can be easily
          // extracted with just one line of code using the methods in the NodeSearch Package as
          // follows: 
        
          Vector<HTMLNode> articleBody = InnerTagGetInclusive
              (page, "div", "class", TextComparitor.C, "article-body");
        
          // which should be written for use with the ScrapeArticles class as using the 'usual'
          // methods in ArticleGet as such:
        
          ArticleGet getter = ArticleGet.usual(TextComparitor.EQ_CI, "article-body");
        


        NOTE: For all examples above, the text-string "article-body" will be a tag-value that (was) decided/chosen by the HTML news-website, or content-website you want to scrape.

        ALSO: One might have to be careful about modifying the input to this Predicate. Each and every one of the NodeSearch classes retrieves a copy (read: a clone) of the input Vector (other than the classes that actually use the term "remove.") However, if you were to write an Article Get lambda of your own (rather than using the "usual" methods), make sure you know whether you are going to intentionally, modify the input-page, and if so, remember you have.

        FURTHERMORE: There are many content-based web-sites that have some (even "a lot") of spurious HTML information inside the primary article body, even after the header & footer information has been eliminated. It may be necessary to do some vector-cleaning later on. For example: getting rid of "Post to Facebook", "Post to Twitter" or "E-Mail Link" buttons.
        Throws:
        ArticleGetException
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag)
        This is a static, factory method for building ArticleGet.

        This builds an "Article Getter" based on a parameter-specified HTML Tag. Two or three common HTML "semantic elements" used for wrapping newspaper article-content include these:

        • <ARTICLE ...> article-body </ARITCLE>
        • <MAIN ...> article-body </MAIN>
        • <SECTION ...> article-body </SECTION>

        Identifying which tag to use can be accomplished by going to the main-page of an internet news web-site, selecting a news-article, and then using the "View Source" or the "View Page Source" depending upon which browser your are using, and then scanning the HTML to find what elements are used to wrap the article-body.

        Call this method, and use the ArticleGet that it generates/returns with the class NewsSiteScrape. As long as the news or content website that you are scraping has it's page-body wrapped inside of an HTML <DIV> element whose CSS 'class' specifier is one you have uncovered by inspecting the page-manually then ArticleGet produced by this factory-method will retrieve your page content appropriately.
        Parameters:
        htmlTag - This should be the HTML element that is used to wrap the actual news-content article-body of an HTML news web-site page.
        Returns:
        This returns an "Article Getter" that just picks out the part of a news-website article that lies between the open and closed version of the specified htmlTag.
        Code:
        Exact Method Body:
         final String htmlTagLC = htmlTag.toLowerCase();
        
         // This 'final String' is merely used for proper error reporting in any potential
         // exception-messages, nothing else.
         final String functionNameStr = "TagNodeGetInclusive.first(page, \"" + htmlTagLC + "\");";
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         // Check for valid HTML Token
         HTMLTokException.check(htmlTagLC);
        
         // Self-Closing / Singleton Tags CANNOT be used with INCLUSIVE Retrieval Operations.
         InclusiveException.check(htmlTagLC);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invocation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occurred.
        
             ArticleGetException.check(url, page);   
        
             Vector<HTMLNode> ret;
        
             try
                 { ret = TagNodeGetInclusive.first(page, htmlTagLC); }
        
             catch (Exception e)
             {
                 throw new ArticleGetException
                     (ArticleGetException.GOT_EXCEPTION, functionNameStr, e);
             }
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case, the "innerHTML" of the specified htmlTag was
             // not found, and produced a null news-article page, or an empty news-article page.
        
             if (ret == null) throw new ArticleGetException
                 (ArticleGetException.RET_NULL, functionNameStr);
        
             if (ret.size() == 0) throw new ArticleGetException
                 (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr);
        
             return ret;
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(TextComparitor tc,
                                java.lang.String... cssClassCompareStrings)
        This is a static, factory method for building ArticleGet.

        This builds an "Article Getter" for you, using the most common way to get an article - specifically via the HTML <DIV CLASS="..."> element and it's CSS 'class' selector.

        Call this method, and use the ArticleGet that it generates/returns with the class NewsSiteScrape. As long as the news or content website that you are scraping has it's page-body wrapped inside of an HTML <DIV> element whose CSS 'class' specifier is one you have uncovered by inspecting the page-manually then ArticleGet produced by this factory-method will retrieve your page content appropriately.
        Parameters:
        tc - This should be any of the pre-instantiated TextComparitor's. Again, a TextComparitor is just a String compare function like: equals, contains, StrCmpr.containsIgnoreCase(...), etc...
        cssClassCompareStrings - These are the values to be used by the TextComparitor when comparing with the value of the CSS-Selector "Class" from the list of DIV elements on the page.
        Returns:
        This returns an "Article Getter" that just picks out the part of a news-website article that lies between the HTML-DIV Element nodes whose class is identified by the "CSS (Cascading Style Sheets) 'class' identifier, and the TextComparitor parameter that you have chosen.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         // Check for valid compareStrings
         TCCompareStrException.check(cssClassCompareStrings);
        
         if (tc == null) throw new NullPointerException
             ("Null has been passed to TextComparitor Parameter 'tc', but this is not allowed here.");
        
         // This 'final' String is merely used for proper error reporting in any potential 
         // exception-messages, nothing else.
        
         final String functionNameStr =
             "InnerTagGetInclusive.first(page, \"div\", \"class\", " +
             STR_FORMAT_TC_PARAMS(tc, cssClassCompareStrings) + ")";
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invocation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occurred.
        
             ArticleGetException.check(url, page);
        
             Vector<HTMLNode> ret;
        
             try
             {
                 ret = InnerTagGetInclusive.first
                     (page, "div", "class", tc, cssClassCompareStrings);
             }
             catch (Exception e) 
             { 
                 throw new ArticleGetException
                     (ArticleGetException.GOT_EXCEPTION, functionNameStr, e);
             }
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case, the "innerHTML" of the specified htmltag and
             // class of the <DIV CLASS=...> produced a null news-article page, or an empty
             // news-article page.
        
             if (ret == null) throw new ArticleGetException
                 (ArticleGetException.RET_NULL, functionNameStr);
        
             if (ret.size() == 0) throw new ArticleGetException
                 (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr);
        
             return ret;
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                TextComparitor tc,
                                java.lang.String... attributeValueCompareStrings)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%, where you specify the HTML-token, attribute-name and use the usual TextComparitor to find the article.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        tc - This should be any of the pre-instantiated TextComparitor's. Again, a TextComparitor is just a String compare function like: equals, contains, StrCmpr.containsIgnoreCase(...).
        attributeValueCompareStrings - These are the String's compared with using the innerTag value using the TextComparitor.
        Returns:
        This returns an "Article Getter" that picks out the part of a news-website article that lies between the HTML element which matches the 'htmlTag', 'innerTag' (id, class, or "other"), and whose attribute-value of the specified inner-tag can be matched by the TextComparitor and the compare-String's.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         TCCompareStrException.check(attributeValueCompareStrings);
        
         if (tc == null) throw new NullPointerException
             ("Null has been passed to TextComparitor Parameter 'tc', but this is not allowed here.");
        
         final String htmlTagLC  = htmlTag.toLowerCase();
         final String innerTagLC = innerTag.toLowerCase();
        
         // This 'final String' is merely used for proper error reporting in any potential
         // exception-messages, nothing else.
        
         final String functionNameStr =
             "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " +
             STR_FORMAT_TC_PARAMS(tc, attributeValueCompareStrings) + ")";
        
         // Check for valid HTML Tag.
         HTMLTokException.check(htmlTagLC);
        
         // Self-Closing / Singleton Tags CANNOT be used with INCLUSIVE Retrieval Operations.
         InclusiveException.check(htmlTagLC);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invocation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occurred.
        
             ArticleGetException.check(url, page);   
        
             Vector<HTMLNode> ret;
        
             try
             { 
                 ret = InnerTagGetInclusive.first
                     (page, htmlTagLC, innerTagLC, tc, attributeValueCompareStrings);
             }
             catch (Exception e) // unlikely
             { 
                 throw new ArticleGetException
                     (ArticleGetException.GOT_EXCEPTION, functionNameStr, e);
             }
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case, the "innerHTML" of the specified htmlTag and
             // attribute produced a null news-article page, or an empty news-article page.
        
             if (ret == null) throw new ArticleGetException
                 (ArticleGetException.RET_NULL, functionNameStr);
        
             if (ret.size() == 0) throw new ArticleGetException
                 (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr);
        
             return ret;
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                java.util.regex.Pattern innerTagValuePattern)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML DIV element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%. Here, you may specify the HTML-token, attribute-name and use a Java Regular-Expression handler to test the value of the attribute - no matter how complicated or bizarre.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        innerTagValuePattern - Any regular-expression. It will be used to PASS or FAIL the attribute-value (a name that is used interchangeably in this scrape/search package for "inner-tag-value") when compared against this regular-expression parameter.

        HELP: This would be like saying:
         // Pick some random HTML TagNode
         TagNode aTagNode        = (TagNode) page.elementAt(index_to_test);
        
         // Gets the attribute value of "innerTag"
         String  attributeValue  = aTagNode.AV(innerTag);
        
         // Make sure the HTML-token is as specified
         // calls to: java.util.regex.*;
         boolean passFail = aTagNode.tok.equals(htmlTag) &&
              innerTagValuePattern.matcher(attributeValue).find();
        
        Returns:
        This returns an "Article Getter" that picks out the part of a news-website article that lays between the HTML element which matches the htmlTag, innerTag and value-testing regex Pattern "innerTagValuePattern".
        Code:
        Exact Method Body:
         final String htmlTagLC  = htmlTag.toLowerCase();
         final String innerTagLC = innerTag.toLowerCase();
        
         // This 'final String' is merely used for proper error reporting in any potential
         // exception-messages, nothing else.
        
         final String functionNameStr =
             "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " +
             innerTagValuePattern.pattern() + ")";
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         HTMLTokException.check(htmlTagLC);
         InclusiveException.check(htmlTagLC);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invocation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occurred.
        
             ArticleGetException.check(url, page);
        
             Vector<HTMLNode> ret;
        
             try
             { 
                 ret = InnerTagGetInclusive.first
                     (page, htmlTagLC, innerTagLC, innerTagValuePattern);
             }
             catch (Exception e) // unlikely
             { 
                 throw new ArticleGetException
                     (ArticleGetException.GOT_EXCEPTION, functionNameStr, e);
             }
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case, the "innerHTML" of the specified htmlTag and
             // attribute produced a null news-article page, or an empty news-article page.
        
             if (ret == null) throw new ArticleGetException
                 (ArticleGetException.RET_NULL, functionNameStr);
        
             if (ret.size() == 0) throw new ArticleGetException
                 (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr);
        
             return ret;            
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String htmlTag,
                                java.lang.String innerTag,
                                java.util.function.Predicate<java.lang.String> p)
        This is a static, factory method for building ArticleGet.

        This gives more options for building your article getter. In almost 95% of the news-websites, the article or page-body is between and open and close HTML 'DIV' element, and the <DIV CLASS="..."> can be found by the CSS 'class' attribute. However, This factory method allows a programmer to select article content that handles other cases than the 95%, where you specify the HTML-token, attribute-name and a Predicate<String> for finding the page-body.
        Parameters:
        htmlTag - This is almost always a "DIV" element, but if you wish to specify something else, possibly a paragraph element (<P>), or maybe an <IFRAME> or <FRAME>, then you may.
        innerTag - This is almost always a "CLASS" attribute, but if you need to use "ID" or something different altogether - possibly a site-specific tag, then use the innerTag / attribute-name of your choice.
        p - This java "lambda Predicate" will just receive the attribute-value from the "inner-tag" and provide a yes/no answer.
        Returns:
        This returns an "Article Getter" that matches an HTML element specified by 'htmlTag', 'innerTag' and the result of the String-Predicate parameter 'p' on the value of that inner-tag.
        Code:
        Exact Method Body:
         final String htmlTagLC  = htmlTag.toLowerCase();
         final String innerTagLC = innerTag.toLowerCase();
        
         // This 'final' String is merely used for proper error reporting in any potential
         // exception-messages, nothing else.
        
         final String functionNameStr =
             "InnerTagGetInclusive.first(page, \"" + htmlTag + "\", \"" + innerTag + "\", " +
             "Predicate<String>)";
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         HTMLTokException.check(htmlTagLC);
         InclusiveException.check(htmlTagLC);
        
         if (p == null) throw new NullPointerException
             ("Null has been passed to Predicate parameter 'p'.  This is not allowed here.");
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invocation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occurred.
        
             ArticleGetException.check(url, page);
        
             Vector<HTMLNode> ret;
        
             try
                 { ret = InnerTagGetInclusive.first(page, htmlTagLC, innerTagLC, p); }
        
             catch (Exception e)
             { 
                 throw new ArticleGetException
                     (ArticleGetException.GOT_EXCEPTION, functionNameStr, e);
             }
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case, the "innerHTML" of the specified htmlTag and
             // attribute produced a null news-article page, or an empty news-article page.
        
             if (ret == null) throw new ArticleGetException
                 (ArticleGetException.RET_NULL, functionNameStr, null);
        
             if (ret.size() == 0) throw new ArticleGetException
                 (ArticleGetException.RET_EMPTY_VECTOR, functionNameStr, null);
        
             return ret;
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.lang.String startTextTag,
                                java.lang.String endTextTag)
        This is a static, factory method for building ArticleGet.

        This factory method generates an "ArticleGet" that will retrieve news-article body-content based on a "start-tag" and an "end-tag." It is very to note, that the text can only match a single text-node, and not span multiple text-nodes, or be within TagNode's at all! This should be easy to find, print up the HTML page as a Vector, and inspect it!
        Parameters:
        startTextTag - This must be text from an HTML TextNode that is contained within one (single) TextNode of the vectorized-HTML page.
        endTextTag - This must be text from an HTML TextNode that is also contained in a single TextNode of the vectorized-HTML page.
        Returns:
        This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the text-tag parameters, and gets it.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         if (startTextTag == null) throw new NullPointerException
             ("Null has been passed to parameter 'startTextTag', but this is not allowed here.");
        
         if (endTextTag == null) throw new NullPointerException
             ("Null has been passed to parameter 'endTextTag', but this is not allowed here.");
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invokation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occured.
        
             ArticleGetException.check(url, page);
        
             int         start   = -1;
             int         end     = -1;
             HTMLNode    n       = null;
        
             while (start++ < page.size())
                 if ((n = page.elementAt(start)) instanceof TextNode)
                     if (n.str.contains(startTextTag))
                         break;
        
             while (end++ < page.size())
                 if ((n = page.elementAt(end)) instanceof TextNode)
                     if (n.str.contains(endTextTag))
                         break;
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case it is because the start/end tags were not found
             // in the text of the vectorized-html news-article web-page.
        
             if (start == page.size()) throw new ArticleGetException(
                 "Start Text Tag [" + startTextTag + "], was not found on the News Article HTML " +
                 "page."
             );
        
             if (end == page.size()) throw new ArticleGetException(
                 "End Text Tag [" + endTextTag + "], was not found on the News Article HTML " +
                 "page."
             );
        
             return Util.cloneRange(page, start, end + 1);
         };
        
      • usual

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet usual​(java.util.regex.Pattern startPattern,
                                java.util.regex.Pattern endPattern)
        This is a static, factory method for building ArticleGet. This factory method generates an "ArticleGet" that will retrieve news-article body-content based on starting and ending regular-expressions. The matches performed by the Regular Expression checker will be performed on TextNode's, not on the TagNode's, or the page itself. It is very to note, that the text can only match a single TextNode, and not span multiple TextNode's, or be within TagNode's at all! This should be easy to find, print up the HTML page as a Vector, and inspect it!
        Parameters:
        startPattern - This must be a regular expression Pattern that matches an HTML TextNode that is contained within one (single) TextNode of the vectorized-HTML page.
        endPattern - This must be a regular expression Pattern that matches an HTML TextNode that is also contained in a single TextNode of the vectorized-HTML page.
        Returns:
        This will return an "Article Getter" that looks for non-HTML Text in the article, specified by the regular-expression pattern-matching parameters, and gets it.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         if (startPattern == null) throw new NullPointerException
             ("Null has been passed to parameter 'startPattern', but this is not allowed here.");
        
         if (endPattern == null) throw new NullPointerException
             ("Null has been passed to parameter 'endPattern', but this is not allowed here.");
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             // This exception-check is done on every invokation of this Lambda-Function.
             // It is merely checking that these inputs are not-null, and page is of non-zero size.
             // ArticleGetException is a compile-time, checked exception.  It is important to halt
             // News-Site Scrape Progress when "Empty News-Page Data" is being passed here.
             // NOTE: This would imply an internal-error with class Download has occured.
        
             ArticleGetException.check(url, page);
             int         start   = -1;
             int         end     = -1;
             HTMLNode    n       = null;
        
             while (start++ < page.size())
                 if ((n = page.elementAt(start)) instanceof TextNode)
                     if (startPattern.matcher(n.str).find())
                         break;
        
             while (end++ < page.size())
                 if ((n = page.elementAt(end)) instanceof TextNode)
                     if (endPattern.matcher(n.str).find())
                         break;
        
             // These error-checks are used to deduce whether the "Article Get" was successful.
             // When this exception is thrown, it means that the user-specified means of "Retrieving
             // an Article Body" FAILED.  In this case it is because the start or end regex failed to
             // match.
        
             if (start == page.size()) throw new ArticleGetException(
                 "Start Pattern [" + startPattern.toString() + "], was not found on the HTML " +
                 "page."
             );
        
             if (end == page.size()) throw new ArticleGetException
                 ("End Pattern [" + endPattern.toString() + "], was not found on the HTML page.");
        
             return Util.cloneRange(page, start, end + 1);
         };
        
      • branch

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet branch​(URLFilter[] urlSelectors,
                                 ArticleGet[] getters)
        This is a static, factory method for building ArticleGet. This is just a way to put a list of article-parse objects into a single "branching" article-parse Object. The two parameters must be equal-length arrays, with non-null elements. Each 'urlSelector' will be tested, and when a selector passes, the ArticleGet that is created will use the "parallel getter" from the parallel array "getters."

        LAY-SPEAK: The best way to summarize this is if a programmer is going to use the NewsSiteScrape class, and planning to scrape a site that has different types of news-articles, he will need differing "ArticleGet" methods. This class will take two array's that match the URL from which the article was retrieved with the particular "getter" method you have provided. When I scrape the address: http://www.baidu.com/ - a Chinese News Web-Site, it links to at least three primary domains:

        1. http://...chinesenews.com/director.../article...
        2. http://...xinhuanet.com/director.../article...
        3. http://...cctv.com/director.../article...

        Results from each of these sites need to be "handled" just ever-so-slightly different.
        Parameters:
        urlSelectors - This is a list of Predicate<URL> elements. When one of these returns TRUE for a particular URL, then the index of that URL-selector in it's array will be used to call the appropriate getter from the parallel-array input-parameter 'getters'.
        getters - This is a list of getter elements. These should be tailored to the particular news-website source that are chosen/selected by the 'urlSelectors' parallel array.
        Returns:
        This will be a "master ArticleGet" or a "dispatch ArticleGet." All it does is simply traverse the first array looking for a Predicate-match from the 'urlSelectors', and then calls the getter in the parallel array.

        NOTE: If none of the 'urlSelectors' match when this "dispatch" or rather "branch" is called by class NewsSiteScrape, the function/getter that is returned will throw an ArticleGetException. It is important that the programmer only allow article URL's that he can capably handled to pass to class NewsSiteScrape.
        Throws:
        java.lang.IllegalArgumentException - Will throw this exception if:

        • Either of these parameters are null
        • If they are not parallel, with differing lengths.
        • If either contain a null value.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // FAIL-FAST: Check user-input for possible errors BEFORE building the Lambda-Function.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         if (urlSelectors.length == 0) throw new IllegalArgumentException
             ("parameter 'urlSelectors' had zero-elements.");
        
         if (getters.length == 0) throw new IllegalArgumentException
             ("parameter 'getters' had zero-elements.");
        
         ParallelArrayException.check(urlSelectors, "urlSelectors", true, getters, "getters", true);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build the instance, using a lambda-expression
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return (URL url, Vector<HTMLNode> page) ->
         {
             for (int i=0; i < urlSelectors.length; i++)
                 if (urlSelectors[i].test(url))
                     return getters[i].apply(url, page);
        
             throw new ArticleGetException(
                 "None of the urlSelecctors you have provided matched the URL sent to this " +
                 "instance of ArticleGet."
             );
         };
        
      • andThen

        🡅  🡇     🗕  🗗  🗖
        default ArticleGet andThen​(ArticleGet after)
        This is the standard-java Function 'andThen' method.
        Parameters:
        after - This is the ArticleGet that will be (automatically) applied after 'this' function.
        Returns:
        A new, composite ArticleGet that performs both operations. It will:

        1. Run 'this' function's 'apply' method to a URL, Vector<HTMLNode>, and return a Vector<HTMLNode>.

        2. Then it will run the 'after' function's 'apply' method to the results of 'this.apply(...)' and return the result.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) -> after.apply(url, this.apply(url, page));
        
      • compose

        🡅  🡇     🗕  🗗  🗖
        default ArticleGet compose​(ArticleGet before)
        This is the standard-java Function 'compose' method.
        Parameters:
        before - This is the ArticleGet that is performed first, whose results are sent to 'this' function.
        Returns:
        A new composite ArticleGet that performs both operations. It will:

        1. Run the 'before' function's 'apply' method to a URL, Vector<HTMLNode>, and return a Vector<HTMLNode>.
        2. Then it will run 'this' function's 'apply' method to the results of the before.apply(...) and return the result.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) -> this.apply(url, before.apply(url, page));
        
      • identity

        🡅  🡇     🗕  🗗  🗖
        static ArticleGet identity()
        The identity function will always return the same Vector<HTMLNode> as output that it receives as input. This is one of the default Java's lambda-methods.
        Returns:
        a new ArticleGet which (it should be obvious) is of type: java.util.function.Function<Vector<HTMLNode>, Vector<HTMLNode>>

        ... where the returned Vector is always the same (identical) to the input Vector.
        Code:
        Exact Method Body:
         return (URL url, Vector<HTMLNode> page) ->
         {
             ArticleGetException.check(url, page);
             return page;
         };