Package Torello.HTML.Tools.NewsSite
Interface LinksGet
-
- All Superinterfaces:
java.util.function.BiFunction<java.net.URL,java.util.Vector<HTMLNode>,java.util.Vector<java.lang.String>>
,java.io.Serializable
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
@FunctionalInterface public interface LinksGet extends java.util.function.BiFunction<java.net.URL,java.util.Vector<HTMLNode>,java.util.Vector<java.lang.String>>, java.io.Serializable
This function-pointer / lambda-target interface which facilitates extracting news-articleURL's
on the main-page (or a sub-sections) of a news-media web-site.
When the classclass ScrapeURLs
is asked to retrieve all Newspaper Web-Site ArticleURL's
, it can do so by passing an instance passed of class'LinksGet'
. Passing a non-null reference of this class is not mandatory, but it can help if it is difficult to separate Article-Links from Advertisement-Links (and other extranneous links).
If null is passed toScrapeURLs
when scraping a News-Oriented Page, then that class will simply return & scrape allURL's
that are extracted from the page. Again, though it isn't mandatory to pass a non-null'LinksGet'
reference when scraping, but if a developer decides not to, then there may be content that is downloaded that isn't actually an article.
Remember, it is also alright to pass a non-nullURLFilter
to classScrapeURLs
to ensure advertisements, and other off-topic pages are avoided.
Best Use Case:
Sites in which Article-URL's
are located in very well defined and specific area of the page ("sections") of the News-Site are the best type of sites to use aLinksGet
instance. Just select someNodeSearch
-Package tools to extract those areas and locations on the page where Article-URL's
are located, and return it to the Scraper.
There are two example ofLinksGet
, below, used for News-Site Scraping.
Unlike the Functional-Interface'ArticleGet'
, this class does not provide any standard Factory-Methods for generating an instances ofLinksGet
. The example below is used for scraping the Spanish (from Spain, not Mexico) news-site "ABC.ES." The Java Lambda-Expression Syntax->
is used to construct this Functional Interface:
Example:
final LinksGet ABC_LINKS_GETTER = (URL url, Vector<HTMLNode> page) -> { Vector<String> ret = new Vector<>(); TagNode tn; String urlStr; // On the Spanish-Language Internet News-Site "http://abc.es/" all article-url's found on the // section-pages are "wrapped" inside and HTML Element <ARTICLE> ... </ARTICLE> wrapper. // // To Retrieve these URL's, just search for the "Inclusive HTML" of all "<ARTICLE>" elements, // and then retrieve the first HTML Anchor '<A HREF=...> ... </A>' URL-Link. The URL String // would be the the value of that Anchor-Tag's HREF-Attribute. for (DotPair article : TagNodeFindL1Inclusive.all(page, "article")) if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a")) != null) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); // Return the list of links return ret; }
This example is used for scraping the Chinese Government Website'www.Gov.CN'
The Java Lambda-Expression Syntax->
is used to construct thisFunctionalInterface:
Example:
final LinksGet GOVCN_CAROUSEL_LINKS_GETTER = (URL url, Vector<HTMLNode> page) -> { Vector<String> ret = new Vector<>(); String urlStr; // As of Fall 2019, the Chinese Government Internet News Site "GOV.CN" has a small Java-Script // "Carousel" where a series of 5 or 6 news-articles may be found. To retrieve these URL's, just // use the NodeSearch package class "Get Inclusive" (by inner-tag value). The carousel is defined // in an HTML divider element ('<DIV CLASS='slider-carousel'> ... </DIV>), so retrieve that divider, // and then get all of the HTML Anchor Elements ('<A HREF=...> ... </A>'), and retrieve the HREF // URL String. Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first (page, "div", "class", TextComparitor.C, "slider-carousel"); for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a")) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); // Return the list of links return ret; }
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/LinksGet.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/LinksGet.java
File Size: 1,215 Bytes Line Count: 35 '\n' Characters Found
-
-
Field Summary
Serializable ID Modifier and Type Field static long
serialVersionUID
-
-
-
Field Detail
-
serialVersionUID
static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable
. Using theSerializable
Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.
Functional Interfaces are usually not thought of as Data Objects that need to be saved, stored and retrieved; however, having the ability to store intermediate results along with the lambda-functions that helped get those results can make debugging easier.- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final long serialVersionUID = 1;
-
-
Method Detail
-
apply
java.util.Vector<java.lang.String> apply(java.net.URL url, java.util.Vector<HTMLNode> page)
FunctionalInterface Target-Method:
This method corresponds to the@FunctionalInterface
Annotation's method requirement. It is the only non-default
, non-static
method in this interface, and may be the target of a Lambda-Expression or'::'
(double-colon) Function-Pointer.
The purpose of this method is to retrieve all of the relevant HTML Anchor Elements from a news-website.- Specified by:
apply
in interfacejava.util.function.BiFunction<java.net.URL,java.util.Vector<HTMLNode>,java.util.Vector<java.lang.String>>
- Parameters:
url
- TheURL
of a section of a newspaper, or content, website.page
- The download of thatURL
into a vectorized-html page.- Returns:
- A list of all the
TagNode's
that have relevantURL
-link information.
-
-