java.lang.Object
- Torello.HTML.Tools.NewsSite.NewsSites

```
public class NewsSites
extends java.lang.Object
```
This class is nothing more than an 'Example Class' that contains some foreign-language based news web-pages, from both overseas and from Latin America.

This class provides five example News Websites with all of the necessary configurations that would be passed to ScrapeURLs, and (subsequently) ScrapeArticles.

The following news-oriented web-sites are provided in this "example" (of sorts) class.
Side Note: Scraping major Associated Press news-sites such as Fox-News, CNN, MSNBC, and Yahoo! News is not a problem for this software - although taking both spiritual and moral stances against the terror that these organizations have caused the world is largely the driving force behind wanting to scrape foreign news sites.
Hi-Lited Source-Code:
- View Here: Torello/HTML/Tools/NewsSite/NewsSites.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/NewsSites.java
File Size: 36,013 Bytes Line Count: 748 '\n' Characters Found

Field Summary

Example of (Extremely-Simple) News Web-Sites: Instantiated Singleton Constants

Modifier and Type	Field
`static NewsSite`	`ABCES`
`static NewsSite`	`ElEspectador`
`static NewsSite`	`ElNacional`
`static NewsSite`	`GovCN`
`static NewsSite`	`GovCNCarousel`
`static NewsSite`	`Pulso`

Method Summary

Functional-Interface Lambda-Target Methods (Functions for 'Function-Pointers')

Modifier and Type	Method
`static Vector<String>`	`ABC_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`EL_ESPECTADOR_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`EL_NACIONAL_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`GOVCN_CAROUSEL_LINKS_GETTER(URL url, Vector<HTMLNode> page)`

Command Line Invocation Methods
Modifier and Type	Method
`static void`	`main(String[] argv)`
`static void`	`runExample()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

ABCES

🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ABCES

This is the NewsSite definition for the Newspaper located at: https://www.abc.es/.

Parameter	Significance
Newspaper Name	ABC España
Country of Origin	Spain
Website URL	`https://abc.es`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`'HREF'` must end with `'.html'` See: `StrFilter.comparitor(TextComparitor, String[])` See: `TextComparitor.EW_CI`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `ABC_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<MAIN>...</MAIN>` See: `ArticleGet.usual(String)`

View a copy of the logs that are generated from using this NewsSite instance.

ABC.ES ScrapeURLs LOG
ScrapeArticles
IMPORTANT NOTE: Though ScrapeURL's code will check for duplicate URL's that may be returned within any given-section, Article URL's may be repeated among the different sections of the newspaper. Since the URL-scrape returned nearly 3,000 articles, the log of an Article scrape is not included here. Proper duplicate URL checking code has obviously been written, but would be too complicated to show in this example.

CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.

Pulso

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite Pulso

This is the NewsSite definition for the Newspaper located at: https://www.elpulso.mx/.

Parameter	Significance
Newspaper Name	El Pulso, México
Country of Origin	México
Website URL	`https://elpulso.mx`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `http://some.domain/YYYY/MM/DD/<article-name>/`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	`null`. Retrieves *all* Anchor-Links on a Section-Page. Note that `URL's` must still pass the previous `StrFilter` (above) in order to be parsed as `Article`'s.
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="entry-content">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

ElNacional

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ElNacional

This is the NewsSite definition for the Newspaper located at:


 https://www.elnacional.com/

.

Parameter	Significance
Newspaper Name	El Nacional
Country of Origin	Venezuela
Website URL	`https://elnacional.com`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`URLFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`null`. The `LinksGet` provided here will only return valid `Article URL's`, so there is no need for a `URLFilter`.
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `EL_NACIONAL_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<ARTICLE>...</ARTICLE>` See: `ArticleGet.usual(String)`

View a copy of the logs that are generated from using this NewsSite.

CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.

ElEspectador

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ElEspectador

This is the NewsSite definition for the Newspaper located at:


 https://www.elespectador.com/

.

Parameter	Significance
Newspaper Name	El Espectador
Country of Origin	Columbia
Website URL	`https://elespectador.com`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must end with a forward-slash `'/'` character. See: `TextComparitor.ENDS_WITH`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `EL_NACIONAL_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="l-main">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.

GovCNCarousel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite GovCNCarousel

This is the NewsSite definition for the Newspaper located at:


 https://www.gov.cn/

.

The "Carousels" are just the emphasized or "HiLighted" links that are on three separate pages. There is a complete-link NewsSite definition that will retrieve all links - not just the links hilited by the carousel.

Parameter	Significance
Newspaper Name	Chinese Government Web Portal
Country of Origin	People's Republic of China
Website URL	`https://gov.cn`
Newspaper Printing Language	Mandarin Chinese

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `"^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `GOVCN_CAROUSEL_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="article ...">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.

GovCN

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite GovCN

This is the NewsSite definition for the Newspaper located at:


 https://www.gov.cn/

.

This version of the "Gov.CN" website will scour a larger set of section URL's, and will not limit the returned Article-Links to just those found on the java-script carousel. The Java-Script Carousel will almost always have a total of five news-article links available. This definition of 'NewsSite' may return up to thirty to forty different articles per news-section.

Parameter	Significance
Newspaper Name	Chinese Government Web Portal
Country of Origin	People's Republic of China
Website URL	`https://gov.cn`
Newspaper Printing Language	Mandarin Chinese

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `"^http://www.gov.cn/(?:.+?/)?\\d{4}-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	`null`. Retrieves *all* Anchor-Links on a Section-Page. Note that `URL's` must still pass the previous `StrFilter` (above) in order to be parsed as `Article`'s.
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="article ...">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

CHANGE: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

NOTE: The logs included in this class' documentation were generated by scrapes in September of 2020.

Method Detail

runExample

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static void runExample()
                       throws java.io.IOException

This example will run the news-site scrape on the Chinese Government News Article Carousel.

IMPORTANT NOTE: This will method will create a directory called "cnb" on your file-system where it will write the contents of (most likely) 15 news-paper articles to disk as HTML files. The output log generated by this method may be viewed here:

Gov.CN.log.html

Throws:

java.io.IOException - This throws for IO errors that may occur when reading the web-server, or when saving the web-pages or images to the file-system.

Code:

Exact Method Body:

 StorageWriter log = new StorageWriter();

 // This directory will contain ".dat" files that are simply "Serialized" HTML Vectors.
 // Each ".dat" file will contain precisely one HTML page.

 final String dataFilesDir = "cnb" + File.separator + "articleData" + File.separator;

 // This directory will contain sub-directories with ".html" files (and image-files)
 // for each news-article that is saved / downloaded.

 final String htmlFilesDir = "cnb" + File.separator + "articleHTML" + File.separator;

 // This CLEARS WHATEVE DATA IS CURRENTLY IN THE DIRECTORY (by deleting all its contents)
 // The following code is the same as the UNIX Shell Command:
 // rm -r cnb/articleData/
 // mkdir cnb/articleData

 FileRW.delTree(dataFilesDir, true, log);

 // The following code is the same as the UNIX Shell Command:
 // rm -r cnb/articleHTML/
 // mkdir cnb/articleHTML

 FileRW.delTree(htmlFilesDir, true, log);


 // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
 // Previous Download Data Erased (if any), Start today's News-Site Scrape
 // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
    
 // Use the "GovCNCarousel" instance that is created in this class as a NewsSite
 NewsSite ns = NewsSites.GovCNCarousel;

 // Call the "Scrape URLs" class to retrieve all of the available newspaper articles
 // on the Java-Script "Article Carousel"  Again, the "Article Carousel" is just this 
 // little widget at the top of the page that rotates (usually) five hilited / emphasized
 // news-article links for today

 Vector<Vector<String>> articleURLs = ScrapeURLs.get(ns, log);

 // This is usually not very important if only a small number of articles are being
 // scraped.  When downloading hundreds of articles - being able to pause if there is a
 // web-site IOError (And restart) is very important.
 //
 // The standard factory-generated "getFSInstance" creates a small file on the file-system
 // for saving the "Download State" while downloading...
    
 Pause pause = Pause.getFSInstance("cnb" + File.separator + "state.dat");
 pause.initialize();

 // The "Scraped Articles" will be sent to the directory named by "dataFilesDir"
 // Using the File-System to save these articles is the default-factory means for
 // saving article-data.  Writing a customized "ScapedArticleReceiver" to do anything
 // from saving article-data to a Data-Base up to and including e-mailing article data
 // is possible using a self-written "ScrapedArticleReceiver"

 ScrapedArticleReceiver receiver = ScrapedArticleReceiver.saveToFS(dataFilesDir);

 // This will download each of the article's from their web-page URL.  The web-page
 // article URL's were retrieved by "Scraped URLs".  The saved HTML (as HTML Vectors)
 // is sent to the "Article Receiver" (defined in the previous step).  These news articles
 // are saved as ".dat" since they are serialized java-objects.
 //
 // Explaining some "unnamed parameters" passed to the method invocation below:
 //
 // true: [skipArticlesWithoutPhotos] Skips Mandarin Chinese Newspaper Articles that do not
 //       include at least one photo.  Photos usually help when reading foreign news articles.
 // null: [bannerAndAdFinder] Some sites include images for Facebook links or advertising.
 //       Gov.CN usually doesn't have these, but occasionally there are extraneous links.
 //       for the purposes of this example, this parameter is ignored, and passed null.
 // false: [keepOriginalPageHTML] The "Complete Page" - content before the Article Body is
 //        extracted from the Article Web-Page is not saved.  This can occasionally be useful
 //        if the HTML <HEAD>...</HEAD> has JSON or React-JS data to extract.

 ScrapeArticles.download
     (receiver, articleURLs, ns.articleGetter, true, null, false, pause, log);
        
 // Now this will convert each of the ".dat" files to an ".html" file - and also it
 // will download the pictures / image included in the article.
 //
 // Explaining some "unnamed parameters" passed to the method invocation below:
 //
 // true: [cleanIt] This runs some basic HTML remove operations.  The best way to see
 //       what the parameter "cleanIt" asks to have removed is to view the class "ToHTML"
 // null: [HTMLModifier] Cleaning up other extraneous links and content in an newspaper
 //       article body like advertising or links to other articles is usually necessary.
 //       Anywhere between 1 and 10 lines of NodeSearch Removal Operations will get rid of
 //       unnecessary HTML.  For the purposes of this example, such a cleaning operation is
 //       not done here - although the final articles do include some "links to other
 //       articles" that is not "CLEANED" like it should be.

 ToHTML.convert(dataFilesDir, htmlFilesDir, true, null, log);

 // NOTE: The log of running this command on Debian UNIX / LINUX may be viewed in the
 // JavaDoc Comments in the top of this method.  If this method is run in an MS-DOS
 // or Windows Environment, there will be no screen colors available to view.

 FileRW.writeFile(
     C.toHTML(log.getString(), true, true, true),
     "cnb" + File.separator + "Gov.CN.log.html"
 );

main

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static void main(java.lang.String[] argv)
                 throws java.io.IOException

Prints the contents of the Data File. Invoking this command allows a programmer to see which "sub-sections" are ascribed to each of the different news-paper definitions in this class. Each "sub-section" is nothing more than a URL-branch of the primary web site URL.

HTML Elements:

 <!-- If the following were the primary news-site -->
 http://news.baidu.com
 
 <!-- This would be a "sub-section" of the primary site -->
 http://news.baidu.com/sports

Can be called from the command line.

If a single command-line argument is passed to "argv[0]", the contents of the "Sections URL Data File" will be output to a text-file that is named using the String passed to "argv[0]".

Parameters:

argv - These are the command line arguments passed by the JRE to this method.

Throws:

java.io.IOException - If there are any problems while attempting to save the output to the the output file (if one was named / requested).

Code:

Exact Method Body:

 // Uncomment this line to run the example code (instead of section-data print)
 // runExample(); System.exit(0);

 // The data-file is loaded into private field "newsPaperSections"
 // This private field is a Hashtable<String, Vector<URL>>.  Convert each of
 // these sections so that they may be printed to terminal and maybe to a text
 // file.

 StringBuilder sb = new StringBuilder();

 for (String newspaper : newsPaperSections.keySet())
 {
     sb.append(newspaper + '\n');
     for (URL section : newsPaperSections.get(newspaper))
         sb.append(section.toString() + '\n');
     sb.append("\n\n***************************************************\n\n");
 }
        
 String s = sb.toString();
 System.out.println(s);
        
 // If there is a command-line parameter, it shall be interpreted a file-name.
 // The contents of the "sections data-file" (as text) will be written a file on the
 // file-system using the String-value of "argv[0]" as the name of the output-filename.

 if (argv.length == 1) FileRW.writeFile(s, argv[0]);

ABC_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> ABC_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


 "https://www.abc.es/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<ARTICLE>...</ARTICLE>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: article a. Specifically it says to find all 'Anchor' elements that are descendants of 'Article' Elements.

Code:

Exact Method Body:

 Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;

 // Links are kept inside <ARTICLE> ... </ARTICLE> on the main / section page.
 for (DotPair article : TagNodeFindL1Inclusive.all(page, "article"))

     // Now find the <A HREF=...> ... </A>
     if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a"))
         != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

EL_NACIONAL_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> EL_NACIONAL_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


 "https://www.ElNacional.com/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="td-module-thumb">...</DIV>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.td-module-thumb a. Specifically it says to find all 'Anchor' elements that are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'td-module-thumb'.

Code:

Exact Method Body:

 Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;

 // Links are kept inside <DIV CLASS=td-module-thumb> ... </DIV> on the main / section page.
 for (DotPair article : InnerTagFindInclusive.all
     (page, "div", "class", TextComparitor.C, "td-module-thumb"))

     // Now find the <A HREF=...> ... </A>
     if ((tn = TagNodeGet.first
         (page, article.start, article.end, TC.OpeningTags, "a")) != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

EL_ESPECTADOR_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> EL_ESPECTADOR_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


 "https://www.ElEspectador.com/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="Card ...">...</DIV>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.Card a.card-link. Specifically it says to find all 'Anchor' elements whose CSS Class contains 'card-link' and which are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'Card'.

Code:

Exact Method Body:

 Vector<String> ret = new Vector<>();

 TagNode tn;
 String  urlStr;

 // Links are kept inside <DIV CLASS="Card ..."> ... </DIV> on the main / section page.
 for (DotPair article : InnerTagFindInclusive.all
     (page, "div", "class", TextComparitor.C, "Card"))

     // Now find the <A CLASS="card-link" HREF=...> ... </A>
     if ((tn = InnerTagGet.first
         (page, article.start, article.end, "a", "class", TextComparitor.C, "card-link"))
             != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

GOVCN_CAROUSEL_LINKS_GETTER

🡅 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.util.Vector<java.lang.String> GOVCN_CAROUSEL_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)
```
The News Site at address: "https://www.gov.cn/" has a Java-Script "Links Carousel". Essentially, there is a section with "Showcased News Articles" that are intended to be emphasize anywhere between four and eight primary articles.

This Links-Carousel is wrapped in an HTML Divider Element as below: <DIV CLASS="slider-carousel">.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div[class=slider-carousel] a. Specifically it says to find all 'Anchor' elements that are descendants of '<DIV CLASS="slider-carousel">' Elements.
See Also:

InnerTagGetInclusive.first(Vector, String, String, TextComparitor, String[]), TagNodeGet.all(Vector, TC, String[]), TagNode.AV(String)

Code:
Exact Method Body:

Vector<String> ret = new Vector<>(); String urlStr; // Find the first <DIV CLASS="slider-carousel"> ... </DIV> section Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first (page, "div", "class", TextComparitor.CN_CI, "slider-carousel"); // Retrieve any HTML Anchor <A HREF=...> ... </A> found within the contents of the // Divider. for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a")) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;

Class NewsSites

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

ABCES

Pulso

ElNacional

ElEspectador

GovCNCarousel

GovCN

Method Detail

runExample

main

ABC_LINKS_GETTER

EL_NACIONAL_LINKS_GETTER

EL_ESPECTADOR_LINKS_GETTER

GOVCN_CAROUSEL_LINKS_GETTER