java.lang.Object
- Torello.HTML.Tools.NewsSite.NewsSites

```
public class NewsSites
extends java.lang.Object
```
This class is nothing more than an 'Example Class' that contains some foreign-language based news web-pages, from both overseas and from Latin America.

This class provides five example News Websites with all of the necessary configurations that would be passed to ScrapeURLs, and (subsequently) ScrapeArticles.

The following news-oriented web-sites are provided in this "example" (of sorts) class.
Side Note: Scraping major Associated Press news-sites such as Fox-News, CNN, MSNBC, and Yahoo! News is not a problem for this software - although taking both spiritual and moral stances against the terror that these organizations have caused the world is largely the driving force behind wanting to scrape foreign news sites.
Hi-Lited Source-Code:
- View Here: Torello/HTML/Tools/NewsSite/NewsSites.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/NewsSites.java
File Size: 12,829 Bytes Line Count: 340 '\n' Characters Found

Field Summary

Example of (Extremely-Simple) News Web-Sites: Instantiated Singleton Constants

Modifier and Type	Field
`static NewsSite`	`ABCES`
`static NewsSite`	`ElEspectador`
`static NewsSite`	`ElNacional`
`static NewsSite`	`GovCN`
`static NewsSite`	`GovCNCarousel`
`static NewsSite`	`Pulso`

Method Summary

Functional-Interface Lambda-Target Methods (Functions for 'Function-Pointers')

Modifier and Type	Method
`static Vector<String>`	`ABC_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`EL_ESPECTADOR_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`EL_NACIONAL_LINKS_GETTER(URL url, Vector<HTMLNode> page)`
`static Vector<String>`	`GOVCN_CAROUSEL_LINKS_GETTER(URL url, Vector<HTMLNode> page)`

Command Line Invocation Methods
Modifier and Type	Method
`static void`	`main(String[] argv)`
`static void`	`runExample()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

ABCES

🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ABCES

This is the NewsSite definition for the Newspaper located at: https://www.abc.es/.

Parameter	Significance
Newspaper Name	ABC España
Country of Origin	Spain
Website URL	`https://abc.es`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`'HREF'` must end with `'.html'` See: `StrFilter.comparitor(TextComparitor, String[])` See: `TextComparitor.EW_CI`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `ABC_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<MAIN>...</MAIN>` See: `ArticleGet.usual(String)`

View a copy of the logs that are generated from using this NewsSite instance.

ABC.ES ScrapeURLs LOG
ScrapeArticles
IMPORTANT NOTE: Though ScrapeURL's code will check for duplicate URL's that may be returned within any given-section, Article URL's may be repeated among the different sections of the newspaper. Since the URL-scrape returned nearly 3,000 articles, the log of an Article scrape is not included here. Proper duplicate URL checking code has obviously been written, but would be too complicated to show in this example.

Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

Note: The logs included in this class' documentation were generated by scrapes in September of 2020.

Pulso

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite Pulso

This is the NewsSite definition for the Newspaper located at: https://www.elpulso.mx/.

Parameter	Significance
Newspaper Name	El Pulso, México
Country of Origin	México
Website URL	`https://elpulso.mx`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `http://some.domain/YYYY/MM/DD/<article-name>/`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	`null`. Retrieves *all* Anchor-Links on a Section-Page. Note that `URL's` must still pass the previous `StrFilter` (above) in order to be parsed as `Article`'s.
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="entry-content">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

ElNacional

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ElNacional

This is the NewsSite definition for the Newspaper located at:


https://www.elnacional.com/

.

Parameter	Significance
Newspaper Name	El Nacional
Country of Origin	Venezuela
Website URL	`https://elnacional.com`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`URLFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`null`. The `LinksGet` provided here will only return valid `Article URL's`, so there is no need for a `URLFilter`.
`LinksGet`<\	Used to *manually* retrieve Article-Link `URL's`	Invokes method `EL_NACIONAL_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<ARTICLE>...</ARTICLE>` See: `ArticleGet.usual(String)`

View a copy of the logs that are generated from using this NewsSite.

Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

Note: The logs included in this class' documentation were generated by scrapes in September of 2020.

ElEspectador

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite ElEspectador

This is the NewsSite definition for the Newspaper located at:


https://www.elespectador.com/

.

Parameter	Significance
Newspaper Name	El Espectador
Country of Origin	Columbia
Website URL	`https://elespectador.com`
Newspaper Printing Language	Spanish

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must end with a forward-slash `'/'` character. See: `TextComparitor.ENDS_WITH`
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `EL_ESPECTADOR_LINKS_GETTER`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="l-main">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

Note: The logs included in this class' documentation were generated by scrapes in September of 2020.

GovCNCarousel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite GovCNCarousel

This is the NewsSite definition for the Newspaper located at:


https://www.gov.cn/

.

The "Carousels" are just the emphasized or "HiLighted" links that are on three separate pages. There is a complete-link NewsSite definition that will retrieve all links - not just the links hilited by the carousel.

Parameter	Significance
Newspaper Name	Chinese Government Web Portal
Country of Origin	People's Republic of China
Website URL	`https://gov.cn`
Newspaper Printing Language	Mandarin Chinese

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `"^http://www.gov.cn/(?:.+?/)?\d{4` -\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"}
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	Invokes method `GOVCN_CAROUSEL_LINKS_GETTER(URL, Vector)`
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="article ...">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

Note: The logs included in this class' documentation were generated by scrapes in September of 2020.

GovCN

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final NewsSite GovCN

This is the NewsSite definition for the Newspaper located at:


https://www.gov.cn/

.

This version of the "Gov.CN" website will scour a larger set of section URL's, and will not limit the returned Article-Links to just those found on the java-script carousel. The Java-Script Carousel will almost always have a total of five news-article links available. This definition of 'NewsSite' may return up to thirty to forty different articles per news-section.

Parameter	Significance
Newspaper Name	Chinese Government Web Portal
Country of Origin	People's Republic of China
Website URL	`https://gov.cn`
Newspaper Printing Language	Mandarin Chinese

Parameter	Purpose	Value
Newspaper Article Groups / Sections	Scrape Sections	Retrieved from Data File
`StrFilter`	News Web-Site Section-Page Aritlce-Link (`<A HREF=...>`) Filter	`HREF` must match: `"^http://www.gov.cn/(?:.+?/)?\d{4` -\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"}
`LinksGet`	Used to *manually* retrieve Article-Link `URL's`	`null`. Retrieves *all* Anchor-Links on a Section-Page. Note that `URL's` must still pass the previous `StrFilter` (above) in order to be parsed as `Article`'s.
`ArticleGet`	Retrieves Article-Body Content from an Article-Link Web-Page	`<DIV CLASS="article ...">...</DIV>` See: `ArticleGet.usual(TextComparitor, String[])` See: `TextComparitor.C`

View a copy of the logs that are generated from using this NewsSite.

Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for the Article Bodies or also Article Links might change on the source news-site... updating the Links and Article Getters (or the Links Filter) is at most a change of 5 lines of code.

If at some point, use of this class results in a long stream of messages indicating that no Article URL-Links were identified, or that the Article-Bodies failed to be extracted, simply look at the raw-HTML from the site and change the getters or Regular-Expressions accordingly.

Note: The logs included in this class' documentation were generated by scrapes in September of 2020.

Method Detail

runExample

🡅 🡇 ⇈ ⮫ External-Java: ⮫ 🗕 🗗 🗖
```
public static void runExample()
                       throws java.io.IOException
```
This example will run the news-site scrape on the Chinese Government News Article Carousel.

Important Note: This will method will create a directory called "cnb" on your file-system where it will write the contents of (most likely) 15 news-paper articles to disk as HTML files.

The output log generated by this method may be viewed here:

Gov.CN.log.html
Throws:

java.io.IOException - This throws for IO errors that may occur when reading the web-server, or when saving the web-pages or images to the file-system.

See Also:

FileRW.delTree(String, boolean, Appendable), NewsSite, FileRW.writeFile(CharSequence, String), C.toHTML(String, boolean, boolean, boolean)

Code:
Exact Method Body:

// Click on the @LinkJavaSource Curved-Arrow to view the example code in full screen RunExample.run();

main

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static void main(java.lang.String[] argv)
                 throws java.io.IOException

Prints the contents of the Data File. Invoking this command allows a programmer to see which "sub-sections" are ascribed to each of the different news-paper definitions in this class. Each "sub-section" is nothing more than a URL-branch of the primary web site URL.

HTML Elements:

<!-- If the following were the primary news-site -->
http://news.baidu.com

<!-- This would be a "sub-section" of the primary site -->
http://news.baidu.com/sports

Can be called from the command line.

If a single command-line argument is passed to "argv[0]", the contents of the "Sections URL Data File" will be output to a text-file that is named using the String passed to "argv[0]".

Parameters:

argv - These are the command line arguments passed by the JRE to this method.

Throws:

java.io.IOException - If there are any problems while attempting to save the output to the the output file (if one was named / requested).

Code:

Exact Method Body:

 // Uncomment this line to run the example code (instead of section-data print)
 // runExample(); System.exit(0);

 // The data-file is loaded into private field "newsPaperSections"
 // This private field is a Hashtable<String, Vector<URL>>.  Convert each of
 // these sections so that they may be printed to terminal and maybe to a text
 // file.

 StringBuilder sb = new StringBuilder();

 for (String newspaper : newsPaperSections.keySet())
 {
     sb.append(newspaper + '\n');
     for (URL section : newsPaperSections.get(newspaper))
         sb.append(section.toString() + '\n');
     sb.append("\n\n***************************************************\n\n");
 }
        
 String s = sb.toString();
 System.out.println(s);
        
 // If there is a command-line parameter, it shall be interpreted a file-name.
 // The contents of the "sections data-file" (as text) will be written a file on the
 // file-system using the String-value of "argv[0]" as the name of the output-filename.

 if (argv.length == 1) FileRW.writeFile(s, argv[0]);

ABC_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> ABC_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


"https://www.abc.es/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<ARTICLE>...</ARTICLE>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: article a. Specifically it says to find all 'Anchor' elements that are descendants of 'Article' Elements.

Code:

Exact Method Body:

 final Vector<String> ret = new Vector<>();

 TagNode tn;
 String urlStr;

 // Links are kept inside <ARTICLE> ... </ARTICLE> on the main / section page.
 for (DotPair article : TagNodeFindL1Inclusive.all(page, "article"))

     // Now find the <A HREF=...> ... </A>
     if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a"))
         != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

EL_NACIONAL_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> EL_NACIONAL_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


"https://www.ElNacional.com/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="td-module-thumb">...</DIV>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.td-module-thumb a. Specifically it says to find all 'Anchor' elements that are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'td-module-thumb'.

Code:

Exact Method Body:

 Vector<String> ret = new Vector<>();       TagNode tn;     String urlStr;

 // Links are kept inside <DIV CLASS=td-module-thumb> ... </DIV> on the main / section page.
 for (DotPair article : InnerTagFindInclusive.all
     (page, "div", "class", TextComparitor.C, "td-module-thumb"))

     // Now find the <A HREF=...> ... </A>
     if ((tn = TagNodeGet.first
         (page, article.start, article.end, TC.OpeningTags, "a")) != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

EL_ESPECTADOR_LINKS_GETTER

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.util.Vector<java.lang.String> EL_ESPECTADOR_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)

The News Site at address:


"https://www.ElEspectador.com/"

is slightly more complicated when retrieving News-Article Links.

Notice that each newspaper article URL-link is "wrapped" in an HTML '<DIV CLASS="Card ...">...</DIV>' Element.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div.Card a.card-link. Specifically it says to find all 'Anchor' elements whose CSS Class contains 'card-link' and which are descendants of 'DIV' Elements where said Divider's CSS CLASS contains 'Card'.

Code:

Exact Method Body:

 Vector<String> ret = new Vector<>();

 TagNode tn;
 String  urlStr;

 // Links are kept inside <DIV CLASS="Card ..."> ... </DIV> on the main / section page.
 for (DotPair article : InnerTagFindInclusive.all
     (page, "div", "class", TextComparitor.C, "Card"))

     // Now find the <A CLASS="card-link" HREF=...> ... </A>
     if ((tn = InnerTagGet.first
         (page, article.start, article.end, "a", "class", TextComparitor.C, "card-link"))
             != null)

         if ((urlStr = tn.AV("href")) != null)
             ret.add(urlStr);

 return ret;

GOVCN_CAROUSEL_LINKS_GETTER

🡅 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.util.Vector<java.lang.String> GOVCN_CAROUSEL_LINKS_GETTER
            (java.net.URL url,
             java.util.Vector<HTMLNode> page)
```
The News Site at address: "https://www.gov.cn/" has a Java-Script "Links Carousel". Essentially, there is a section with "Showcased News Articles" that are intended to be emphasize anywhere between four and eight primary articles.

This Links-Carousel is wrapped in an HTML Divider Element as below: <DIV CLASS="slider-carousel">.

If this code were translated into an "XPath Query" or "CSS Selector", it would read: div[class=slider-carousel] a. Specifically it says to find all 'Anchor' elements that are descendants of '<DIV CLASS="slider-carousel">' Elements.
See Also:

InnerTagGetInclusive.first(Vector, String, String, TextComparitor, String[]), TagNodeGet.all(Vector, TC, String[]), TagNode.AV(String)

Code:
Exact Method Body:

Vector<String> ret = new Vector<>(); String urlStr; // Find the first <DIV CLASS="slider-carousel"> ... </DIV> section Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first (page, "div", "class", TextComparitor.CN_CI, "slider-carousel"); // Retrieve any HTML Anchor <A HREF=...> ... </A> found within the contents of the // Divider. for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a")) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;

Class NewsSites

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

ABCES

Pulso

ElNacional

ElEspectador

GovCNCarousel

GovCN

Method Detail

runExample

main

ABC_LINKS_GETTER

EL_NACIONAL_LINKS_GETTER

EL_ESPECTADOR_LINKS_GETTER

GOVCN_CAROUSEL_LINKS_GETTER