Interface LinksGet

  • All Superinterfaces:
    java.util.function.BiFunction<java.net.URL,​java.util.Vector<HTMLNode>,​java.util.Vector<java.lang.String>>, java.io.Serializable
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public interface LinksGet
    extends java.util.function.BiFunction<java.net.URL,​java.util.Vector<HTMLNode>,​java.util.Vector<java.lang.String>>, java.io.Serializable
    This function-pointer / lambda-target interface which facilitates extracting news-article URL's on the main-page (or a sub-sections) of a news-media web-site.

    When the class ScrapeURLs is asked to retrieve all news-paper web-site Article URL's, it can do so by passing an instance passed of this class 'LinksGet'. Passing a non-null reference of this class LinksGet is not mandatory, but it can help for news-sites where identifying which URL's are pointing newspaper articles and which URL's are pointing to advertisements, or other extraneous, non-news locatons. If an instance of this class is not passed to the ScrapeURLs class, then the class will retrieve all URL's found on the page. Remember, it is not mandatory to pass a 'LinksGet' instance to ScrapeURLs, and even if 'extraneous' links are retrieved, the programmer may still pass a URLFilter to the ScrapeURLs to ensure advertisements, and other off-topic pages are avoided.

    PRIMARY USE: Sites in which Article URL's are located in very well understood and specified areas (or, rather "sections"} of the news-site main-pages should make use of this class. The URLFilter mechanism can require that only Article URL's that match certain regular-expressions will pass the Article URL scrape-logic. This class LinksGet allows a user to specify areas and locations on the page for finding the links - regardless of the structure or properties of the web-page itself. There is are two example of links-getters, below, used for news-site scraping.

    Unlike the functional-interface 'ArticleGet', this class does not provide any simple or straight-forward factory-methods for generating an instances of LinksGet. In fact, even using this class might seem "Redundant" in the parameter list, since the parameter "URLFilter" can accomplish the act of filtering which URL's are included, and which are not. Knowing how every news-site on the internet functions is beyond the scope of this project - and this class shall remain intact and included among the parameters to the method ScrapeURLs.get(...) - even though it is usually easier to implement one of the factory instances of URLFilter instead.

    This example is used for scraping the Spanish (from Spain, not Mexico) news-site "ABC.ES." The Java Lambda Syntax -> is used to construct this Functional Interface:

    Example:
    final LinksGet ABC_LINKS_GETTER = (URL url, Vector<HTMLNode> page) ->
    {
        Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;
    
        // On the Spanish-Language Internet News-Site "http://abc.es/" all article-url's found on the section-pages
        // are "wrapped" inside and HTML Element <ARTICLE> ... </ARTICLE> wrapper.
        // To Retrieve these URL's, just search for the "Inclusive HTML" of all "<ARTICLE>" elements, and then
        // and then retrieve the first HTML Anchor '<A HREF=...> ... </A>' url-link.  The URL String would be the
        // the value of the 'HREF' attribute / inner-tag in that anchor.
    
        for (DotPair article : TagNodeFindL1Inclusive.all(page, "article"))
            if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a")) != null)
                if ((urlStr = tn.AV("href")) != null)
                    ret.add(urlStr);
    
        // Return the list of links
        return ret;
    }
    


    This example is used for scraping the Chinese Government Website 'www.Gov.CN' The Java Lambda-Expression Syntax -> is used to construct this FunctionalInterface:

    Example:
    final LinksGet GOVCN_CAROUSEL_LINKS_GETTER = (URL url, Vector<HTMLNode> page) ->
    {
        Vector<String> ret = new Vector<>();        String urlStr;
    
        // As of Fall 2019, the Chinese Government Internet News Site "GOV.CN" has a small Java-Script
        // "Carousel" where a series of 5 or 6 news-articles may be found.  To retrieve these URL's, just 
        // use the NodeSearch package class "Get Inclusive" (by inner-tag value).  The carousel is defined
        // in an HTML divider element ('<DIV CLASS='slider-carousel'> ... </DIV>), so retrieve that divider,
        // and then get all of the HTML Anchor Elements ('<A HREF=...> ... </A>'), and retrieve the HREF
        // URL String.
    
        Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first
            (page, "div", "class", TextComparitor.C, "slider-carousel");
    
        for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a"))
            if ((urlStr = tn.AV("href")) != null)
                ret.add(urlStr);
    
        // Return the list of links
        return ret;
    }
    


    • Field Summary

       
      Serializable ID
      Modifier and Type Field
      static long serialVersionUID
    • Method Summary

       
      @FunctionalInterface (Lambda) Method
      Modifier and Type Method
      Vector<String> apply​(URL url, Vector<HTMLNode> page)
      • Methods inherited from interface java.util.function.BiFunction

        andThen
    • Field Detail

      • serialVersionUID

        🡇    
        static final long serialVersionUID
        This fulfils the SerialVersion UID requirement for all classes that implement Java's interface java.io.Serializable. Using the Serializable Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.

        Functional Interfaces are usually not thought of as Data Objects that need to be saved, stored and retrieved; however, having the ability to store intermediate results along with the lambda-functions that helped get those results can make debugging easier.
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         public static final long serialVersionUID = 1;
        
    • Method Detail

      • apply

        🡅    
        java.util.Vector<java.lang.String> apply​(java.net.URL url,
                                                 java.util.Vector<HTMLNode> page)
        FUNCTIONAL-INTERFACE METHOD: This is the method that fulfills this functional-interface 'apply' method. The purpose of this method is to retrieve all of the relevant HTML Anchor Elements from a news-website.
        Specified by:
        apply in interface java.util.function.BiFunction<java.net.URL,​java.util.Vector<HTMLNode>,​java.util.Vector<java.lang.String>>
        Parameters:
        url - The URL of a section of a newspaper, or content, website.
        page - The download of that URL into a vectorized-html page.
        Returns:
        A list of all the TagNode's that have relevant URL-link information.