Package Torello.HTML.Tools.NewsSite
Class NewsSites
- java.lang.Object
-
- Torello.HTML.Tools.NewsSite.NewsSites
-
public class NewsSites extends java.lang.Object
This class is nothing more than an 'Example Class' that contains some foreign-language based news web-pages, from both overseas and from Latin America.
This class provides five example News Websites with all of the necessary configurations that would be passed toScrapeURLs, and (subsequently)ScrapeArticles.
The following news-oriented web-sites are provided in this "example" (of sorts) class.- https://abc.es
- https://elnacional.com
- https://elespectador.com
- https://www.gov.cn
- https://elpulso.mx
Side Note: Scraping major Associated Press news-sites such as Fox-News, CNN, MSNBC, and Yahoo! News is not a problem for this software - although taking both spiritual and moral stances against the terror that these organizations have caused the world is largely the driving force behind wanting to scrape foreign news sites.
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/NewsSites.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/NewsSites.java
File Size: 12,829 Bytes Line Count: 340 '\n' Characters Found
-
-
Field Summary
Example of (Extremely-Simple) News Web-Sites: Instantiated Singleton Constants Modifier and Type Field static NewsSiteABCESstatic NewsSiteElEspectadorstatic NewsSiteElNacionalstatic NewsSiteGovCNstatic NewsSiteGovCNCarouselstatic NewsSitePulso
-
Method Summary
Functional-Interface Lambda-Target Methods (Functions for 'Function-Pointers') Modifier and Type Method static Vector<String>ABC_LINKS_GETTER(URL url, Vector<HTMLNode> page)static Vector<String>EL_ESPECTADOR_LINKS_GETTER(URL url, Vector<HTMLNode> page)static Vector<String>EL_NACIONAL_LINKS_GETTER(URL url, Vector<HTMLNode> page)static Vector<String>GOVCN_CAROUSEL_LINKS_GETTER(URL url, Vector<HTMLNode> page)Command Line Invocation Methods Modifier and Type Method static voidmain(String[] argv)static voidrunExample()
-
-
-
Field Detail
-
ABCES
public static final NewsSite ABCES
This is theNewsSitedefinition for the Newspaper located at:https://www.abc.es/.Parameter Significance Newspaper Name ABC España Country of Origin Spain Website URL https://abc.esNewspaper Printing Language Spanish Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File StrFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) Filter'HREF'must end with'.html'
See:StrFilter.comparitor(TextComparitor, String[])
See:TextComparitor.EW_CILinksGetUsed to manually retrieve Article-Link URL'sInvokes method ABC_LINKS_GETTER(URL, Vector)ArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <MAIN>...</MAIN>
See:ArticleGet.usual(String)
View a copy of the logs that are generated from using thisNewsSiteinstance.ABC.ES ScrapeURLs LOG
ScrapeArticles
IMPORTANT NOTE: ThoughScrapeURL'scode will check for duplicateURL'sthat may be returned within any given-section,Article URL'smay be repeated among the different sections of the newspaper. Since theURL-scrape returned nearly 3,000 articles, the log of anArticlescrape is not included here. Proper duplicateURLchecking code has obviously been written, but would be too complicated to show in this example.
Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for theArticle Bodiesor alsoArticle Linksmight change on the source news-site... updating theLinksandArticleGetters (or theLinksFilter) is at most a change of 5 lines of code.
If at some point, use of this class results in a long stream of messages indicating that noArticle URL-Links were identified, or that theArticle-Bodiesfailed to be extracted, simply look at the raw-HTML from the site and change the getters orRegular-Expressionsaccordingly.Note: The logs included in this class' documentation were generated by scrapes in September of 2020.
-
Pulso
public static final NewsSite Pulso
This is theNewsSitedefinition for the Newspaper located at:https://www.elpulso.mx/.Parameter Significance Newspaper Name El Pulso, México Country of Origin México Website URL https://elpulso.mxNewspaper Printing Language Spanish Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File StrFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) FilterHREFmust match:http://some.domain/YYYY/MM/DD/<article-name>/LinksGetUsed to manually retrieve Article-Link URL'snull. Retrieves all Anchor-Links on a Section-Page. Note thatURL'smust still pass the previousStrFilter(above) in order to be parsed asArticle's.ArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="entry-content">...</DIV>
See:ArticleGet.usual(TextComparitor, String[])
See:TextComparitor.C
-
ElNacional
public static final NewsSite ElNacional
This is theNewsSitedefinition for the Newspaper located at:https://www.elnacional.com/.Parameter Significance Newspaper Name El Nacional Country of Origin Venezuela Website URL https://elnacional.comNewspaper Printing Language Spanish Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File URLFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) Filternull. TheLinksGetprovided here will only return validArticle URL's, so there is no need for aURLFilter.LinksGet<\Used to manually retrieve Article-Link URL'sInvokes method EL_NACIONAL_LINKS_GETTER(URL, Vector)ArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <ARTICLE>...</ARTICLE>
See:ArticleGet.usual(String)
View a copy of the logs that are generated from using thisNewsSite.
Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for theArticle Bodiesor alsoArticle Linksmight change on the source news-site... updating theLinksandArticleGetters (or theLinksFilter) is at most a change of 5 lines of code.
If at some point, use of this class results in a long stream of messages indicating that noArticle URL-Links were identified, or that theArticle-Bodiesfailed to be extracted, simply look at the raw-HTML from the site and change the getters orRegular-Expressionsaccordingly.Note: The logs included in this class' documentation were generated by scrapes in September of 2020.
-
ElEspectador
public static final NewsSite ElEspectador
This is theNewsSitedefinition for the Newspaper located at:https://www.elespectador.com/.Parameter Significance Newspaper Name El Espectador Country of Origin Columbia Website URL https://elespectador.comNewspaper Printing Language Spanish Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File StrFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) FilterHREFmust end with a forward-slash'/'character.
See:TextComparitor.ENDS_WITHLinksGetUsed to manually retrieve Article-Link URL'sInvokes method EL_ESPECTADOR_LINKS_GETTERArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="l-main">...</DIV>
See:ArticleGet.usual(TextComparitor, String[])
See:TextComparitor.C
View a copy of the logs that are generated from using thisNewsSite.
Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for theArticle Bodiesor alsoArticle Linksmight change on the source news-site... updating theLinksandArticleGetters (or theLinksFilter) is at most a change of 5 lines of code.
If at some point, use of this class results in a long stream of messages indicating that noArticle URL-Links were identified, or that theArticle-Bodiesfailed to be extracted, simply look at the raw-HTML from the site and change the getters orRegular-Expressionsaccordingly.Note: The logs included in this class' documentation were generated by scrapes in September of 2020.
-
GovCNCarousel
public static final NewsSite GovCNCarousel
This is theNewsSitedefinition for the Newspaper located at:https://www.gov.cn/.
The "Carousels" are just the emphasized or "HiLighted" links that are on three separate pages. There is a complete-linkNewsSitedefinition that will retrieve all links - not just the links hilited by the carousel.Parameter Significance Newspaper Name Chinese Government Web Portal Country of Origin People's Republic of China Website URL https://gov.cnNewspaper Printing Language Mandarin Chinese Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File StrFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) FilterHREFmust match:"^http://www.gov.cn/(?:.+?/)?\d{4-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"}LinksGetUsed to manually retrieve Article-Link URL'sInvokes method GOVCN_CAROUSEL_LINKS_GETTER(URL, Vector)ArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="article ...">...</DIV>
See:ArticleGet.usual(TextComparitor, String[])
See:TextComparitor.C
View a copy of the logs that are generated from using thisNewsSite.
Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for theArticle Bodiesor alsoArticle Linksmight change on the source news-site... updating theLinksandArticleGetters (or theLinksFilter) is at most a change of 5 lines of code.
If at some point, use of this class results in a long stream of messages indicating that noArticle URL-Links were identified, or that theArticle-Bodiesfailed to be extracted, simply look at the raw-HTML from the site and change the getters orRegular-Expressionsaccordingly.Note: The logs included in this class' documentation were generated by scrapes in September of 2020.
-
GovCN
public static final NewsSite GovCN
This is theNewsSitedefinition for the Newspaper located at:https://www.gov.cn/.
This version of the "Gov.CN" website will scour a larger set of sectionURL's, and will not limit the returned Article-Links to just those found on the java-script carousel. The Java-Script Carousel will almost always have a total of five news-article links available. This definition of'NewsSite'may return up to thirty to forty different articles per news-section.Parameter Significance Newspaper Name Chinese Government Web Portal Country of Origin People's Republic of China Website URL https://gov.cnNewspaper Printing Language Mandarin Chinese Parameter Purpose Value Newspaper Article Groups / Sections Scrape Sections Retrieved from Data File StrFilterNews Web-Site Section-Page Aritlce-Link ( <A HREF=...>) FilterHREFmust match:"^http://www.gov.cn/(?:.+?/)?\d{4-\\d{2}/\\d{2}/(?:.+?/)?content_\\d+.htm(?:l)?(#\\d+)?"}LinksGetUsed to manually retrieve Article-Link URL'snull. Retrieves all Anchor-Links on a Section-Page. Note thatURL'smust still pass the previousStrFilter(above) in order to be parsed asArticle's.ArticleGetRetrieves Article-Body Content from an Article-Link Web-Page <DIV CLASS="article ...">...</DIV>
See:ArticleGet.usual(TextComparitor, String[])
See:TextComparitor.C
View a copy of the logs that are generated from using thisNewsSite.
Change: There are no guarantees when scraping HTML from the Internet. If any of the news-providers in this example-class were to modify or update the HTML that servers their news-stories, there is a real chance that the "Getters" and "Filters" in these examples would no longer be valid. It is important to realize, though, that although the HTML wrappers for theArticle Bodiesor alsoArticle Linksmight change on the source news-site... updating theLinksandArticleGetters (or theLinksFilter) is at most a change of 5 lines of code.
If at some point, use of this class results in a long stream of messages indicating that noArticle URL-Links were identified, or that theArticle-Bodiesfailed to be extracted, simply look at the raw-HTML from the site and change the getters orRegular-Expressionsaccordingly.Note: The logs included in this class' documentation were generated by scrapes in September of 2020.
-
-
Method Detail
-
runExample
public static void runExample() throws java.io.IOException
This example will run the news-site scrape on the Chinese Government News Article Carousel.Important Note: This will method will create a directory called "cnb" on your file-system where it will write the contents of (most likely) 15 news-paper articles to disk as HTML files.
The output log generated by this method may be viewed here:Gov.CN.log.html- Throws:
java.io.IOException- This throws for IO errors that may occur when reading the web-server, or when saving the web-pages or images to the file-system.- See Also:
FileRW.delTree(String, boolean, Appendable),NewsSite,FileRW.writeFile(CharSequence, String),C.toHTML(String, boolean, boolean, boolean)- Code:
- Exact Method Body:
// Click on the @LinkJavaSource Curved-Arrow to view the example code in full screen RunExample.run();
-
main
public static void main(java.lang.String[] argv) throws java.io.IOException
Prints the contents of the Data File. Invoking this command allows a programmer to see which "sub-sections" are ascribed to each of the different news-paper definitions in this class. Each "sub-section" is nothing more than aURL-branch of the primary web siteURL.
HTML Elements:
<!-- If the following were the primary news-site --> http://news.baidu.com <!-- This would be a "sub-section" of the primary site --> http://news.baidu.com/sports
Can be called from the command line.
If a single command-line argument is passed to"argv[0]", the contents of the "Sections URL Data File" will be output to a text-file that is named using theStringpassed to"argv[0]".- Parameters:
argv- These are the command line arguments passed by the JRE to this method.- Throws:
java.io.IOException- If there are any problems while attempting to save the output to the the output file (if one was named / requested).- Code:
- Exact Method Body:
// Uncomment this line to run the example code (instead of section-data print) // runExample(); System.exit(0); // The data-file is loaded into private field "newsPaperSections" // This private field is a Hashtable<String, Vector<URL>>. Convert each of // these sections so that they may be printed to terminal and maybe to a text // file. StringBuilder sb = new StringBuilder(); for (String newspaper : newsPaperSections.keySet()) { sb.append(newspaper + '\n'); for (URL section : newsPaperSections.get(newspaper)) sb.append(section.toString() + '\n'); sb.append("\n\n***************************************************\n\n"); } String s = sb.toString(); System.out.println(s); // If there is a command-line parameter, it shall be interpreted a file-name. // The contents of the "sections data-file" (as text) will be written a file on the // file-system using the String-value of "argv[0]" as the name of the output-filename. if (argv.length == 1) FileRW.writeFile(s, argv[0]);
-
ABC_LINKS_GETTER
public static java.util.Vector<java.lang.String> ABC_LINKS_GETTER (java.net.URL url, java.util.Vector<HTMLNode> page)
The News Site at address:"https://www.abc.es/"is slightly more complicated when retrieving News-Article Links.
Notice that each newspaper articleURL-link is "wrapped" in an HTML'<ARTICLE>...</ARTICLE>'Element.
If this code were translated into an "XPath Query" or "CSS Selector", it would read:article a. Specifically it says to find all'Anchor'elements that are descendants of'Article'Elements.- See Also:
TagNodeFindL1Inclusive.all(Vector, String),TagNodeGet.first(Vector, int, int, TC, String[]),TagNode.AV(String)- Code:
- Exact Method Body:
final Vector<String> ret = new Vector<>(); TagNode tn; String urlStr; // Links are kept inside <ARTICLE> ... </ARTICLE> on the main / section page. for (DotPair article : TagNodeFindL1Inclusive.all(page, "article")) // Now find the <A HREF=...> ... </A> if ((tn = TagNodeGet.first(page, article.start, article.end, TC.OpeningTags, "a")) != null) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;
-
EL_NACIONAL_LINKS_GETTER
public static java.util.Vector<java.lang.String> EL_NACIONAL_LINKS_GETTER (java.net.URL url, java.util.Vector<HTMLNode> page)
The News Site at address:"https://www.ElNacional.com/"is slightly more complicated when retrieving News-Article Links.
Notice that each newspaper articleURL-link is "wrapped" in an HTML'<DIV CLASS="td-module-thumb">...</DIV>'Element.
If this code were translated into an "XPath Query" or "CSS Selector", it would read:div.td-module-thumb a. Specifically it says to find all'Anchor'elements that are descendants of'DIV'Elements where said Divider's CSSCLASScontains'td-module-thumb'.- See Also:
InnerTagFindInclusive.all(Vector, String, String, TextComparitor, String[]),TagNodeGet.first(Vector, int, int, TC, String[]),TagNode.AV(String)- Code:
- Exact Method Body:
Vector<String> ret = new Vector<>(); TagNode tn; String urlStr; // Links are kept inside <DIV CLASS=td-module-thumb> ... </DIV> on the main / section page. for (DotPair article : InnerTagFindInclusive.all (page, "div", "class", TextComparitor.C, "td-module-thumb")) // Now find the <A HREF=...> ... </A> if ((tn = TagNodeGet.first (page, article.start, article.end, TC.OpeningTags, "a")) != null) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;
-
EL_ESPECTADOR_LINKS_GETTER
public static java.util.Vector<java.lang.String> EL_ESPECTADOR_LINKS_GETTER (java.net.URL url, java.util.Vector<HTMLNode> page)
The News Site at address:"https://www.ElEspectador.com/"is slightly more complicated when retrieving News-Article Links.
Notice that each newspaper articleURL-link is "wrapped" in an HTML'<DIV CLASS="Card ...">...</DIV>'Element.
If this code were translated into an "XPath Query" or "CSS Selector", it would read:div.Card a.card-link. Specifically it says to find all'Anchor'elements whose CSSClasscontains'card-link'and which are descendants of'DIV'Elements where said Divider's CSSCLASScontains'Card'.- See Also:
InnerTagFindInclusive.all(Vector, String, String, TextComparitor, String[]),InnerTagGet.first(Vector, int, int, String, String, TextComparitor, String[]),TagNode.AV(String)- Code:
- Exact Method Body:
Vector<String> ret = new Vector<>(); TagNode tn; String urlStr; // Links are kept inside <DIV CLASS="Card ..."> ... </DIV> on the main / section page. for (DotPair article : InnerTagFindInclusive.all (page, "div", "class", TextComparitor.C, "Card")) // Now find the <A CLASS="card-link" HREF=...> ... </A> if ((tn = InnerTagGet.first (page, article.start, article.end, "a", "class", TextComparitor.C, "card-link")) != null) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;
-
GOVCN_CAROUSEL_LINKS_GETTER
public static java.util.Vector<java.lang.String> GOVCN_CAROUSEL_LINKS_GETTER (java.net.URL url, java.util.Vector<HTMLNode> page)
The News Site at address:"https://www.gov.cn/"has a Java-Script "Links Carousel". Essentially, there is a section with "Showcased News Articles" that are intended to be emphasize anywhere between four and eight primary articles.
This Links-Carousel is wrapped in an HTML Divider Element as below:<DIV CLASS="slider-carousel">.
If this code were translated into an "XPath Query" or "CSS Selector", it would read:div[class=slider-carousel] a. Specifically it says to find all'Anchor'elements that are descendants of'<DIV CLASS="slider-carousel">'Elements.- See Also:
InnerTagGetInclusive.first(Vector, String, String, TextComparitor, String[]),TagNodeGet.all(Vector, TC, String[]),TagNode.AV(String)- Code:
- Exact Method Body:
Vector<String> ret = new Vector<>(); String urlStr; // Find the first <DIV CLASS="slider-carousel"> ... </DIV> section Vector<HTMLNode> carouselDIV = InnerTagGetInclusive.first (page, "div", "class", TextComparitor.CN_CI, "slider-carousel"); // Retrieve any HTML Anchor <A HREF=...> ... </A> found within the contents of the // Divider. for (TagNode tn: TagNodeGet.all(carouselDIV, TC.OpeningTags, "a")) if ((urlStr = tn.AV("href")) != null) ret.add(urlStr); return ret;
-
-