Package Torello.HTML.Tools.NewsSite
Class ScrapeURLs
- java.lang.Object
-
- Torello.HTML.Tools.NewsSite.ScrapeURLs
-
public class ScrapeURLs extends java.lang.Object
Collects all news-articleURL's
from a news oriented web-site's main web-page and from the list 'sub-section' web-pages.News-Site Scrape: User-Main A.P.I. Class
This class will scour a News or Information Web-Site for all relevant ArticleURL's
, and save thoseURL's
toVector
. Once completed, a complete list of Article-URL's
may be returned to the user for subsequent downloading of the Article'sHTML
-content.
Once theURL's
have been collected, classScrapeArticles
may be used to retrieve the contents of each of the pages for thoseURL's
.
This HTML Search, Parse and Scrape package was initially written to help download and translate news-articles from web-sites that appear to be from overseas and across the oceans.
The purpose of this class is to scrape the relevant news-paper articles from an Internet News Web-Site. These article URL's are returned inside of a "Vector
ofVector's
" As should be obvious, most news-based web-sites on the Internet have, since their founding, divided different news-articles into separate "sub-sections." Such sections often include "News", "World News", "Life", "Sports", "Finance" etc...
Generally, searching through only the "top-level" news-site web-page is not enough to retrieve all articles available on the page for any given day of the week. The primary purpose of this class is to visit each page on a user-provided "Section's List", and identify each and every Article-URL
available on each of these sub-sections (and return those lists to the programmer).
The "Vector
ofVector's
" that is returned by this class'"get"
methods will return all identified News ArticleURL's
in each sub-section of any news web-site, assuming the appropriate "Getters" have been provided. This list of sub-sections (which have been described here) are expected to be provided to the"get"
method, when invoking it, by passing a list of sections to the parameter"sectionURLs"
.
In addition to a list of sub-sections, the user should also specify an instance ofURLFilter
. This filter helps inform the scraper which URL's to ignore, and which to keep. In most of the news-sites that have been tested with this package, any non-advertising "related article URL's" always seem to have a very specific pattern that any plain-old regular-expression could identify, easily.
This package has a smallLambda-Target
(Function-Pointer) class calledLinksGet
that lets you use any number of very common and very simple mechanisms for identifying (as in a 'PASS' / 'FAIL') whichURL's
are, indeed,URL's
for an actual News-Article. This allows the programmer to skip over swarths of advertisment, photo-journal, and any number of irrelevant link-pages.
Perhaps the user may wonder what work this class is actually doing if it is necessary to provided instances ofURLFilter
and a Vector'sectionURLs'
- ... and the answer is not a lot! This class is actually very short, and just ensures that as much error checking as possible is done, and that the returnedVector
has been checked for validity!
REMEMBER:
Building an instance ofLinksGet
should require nothing more than perusing the HTML on the sections of your site, and checking out what features each of the actual article-URL's
have in common.
Here is an example "URL Retrieve" operation on the Mandarin Chinese Language Government Web Portal available in North America. Translating these pages for study about the politics and technology from the other side of the Pacific Ocean was the primary impetus for developing the Java-HTML JAR Library.
Example:
// Sample Article URL from the Chinese National Web-Portal - all valid articles have the // basic pattern // http://www.gov.cn/xinwen/2020-07/17/content_5527889.htm // // This "Regular Expression" will match any News Article URL that "looks like" the above URL. // This Regular-Expression can be passed to class URLFilter. String articleURLRegExStr = "http://www.gov.cn/xinwen/\\d\\d\\d\\d-\\d\\d/\\d\\d/content_\\d+.html?"; Pattern articleURLsRegEx = Pattern.compile(articleURLRegExStr); URLFilter filter = URLFilter.fromStrFilter(StrFilter.regExKEEP(articleURLsRegEx, true)); // This will hold the list of "Main Page Sub-Sections". In this example, we will only look at // Articles on the "First Page", and the rest of the News-Papers Sub-Sections will be ignored. // // For the purposes of this example, only one section of the 'www.Gov.CN/' web-portal will be // visited. There are other "Newspaper SubSections" that could easily be added to this Vector. // If more sections were added, more news-article URL's would likely be found, identified and // returned. Vector<URL> sectionURLs = new Vector<>(1); sectionURLs.add(new URL("https://www.gov.cn/")); // Run the Article URL scraper. In this example, the 'filter' (a URLFilter) is enough for // getting the ArticleURL's. 'null' is passed to the LinksGet parameter. Vector<Vector<String>> articleURLs = ScrapeURLs.get(sectionURLs, filter, null, System.out); // This will write every article URL to a text file called "urls.txt". // // NOTE: Since only one Sub-Section was added in this example, there is no need to write out // the entire "Vector of Vectors", but rather just the first (and only) element's contents FileRW.writeFile(articleURLs.elementAt(0), "urls.txt"); // This will write the article-URL's vector to a serialized-object data-file called "urls.vdat" FileRW.writeObjectToFile(articleURLs, "urls.vdat", true); // AT THIS POINT, YOU SHOULD BE READY TO RUN THE ARTICLE-SCRAPER CLASS // ...
NOTE:
The'urls.vdat'
file that was created can easily be retrieved using Java's de-serialization streams. If thecast
(below) were necessary, then anannotation
of the format@SuppressWarnings("unchecked")
would be required.
Using Java's Serialization and De-Serailization Mechanism for saving temporary results to disk is extremely easy in Java-HTML.
Java Line of Code:
Vector<URL> urls = (Vector<URL>) FileRW.readObjectFromFile("urls.vdat", Vector.class, true);
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/ScrapeURLs.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/ScrapeURLs.java
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 3 Method(s), 3 declared static
- 1 Field(s), 1 declared static, 0 declared final
- Fields excused from final modifier (with explanation):
Field 'SKIP_ON_SECTION_URL_EXCEPTION' is not final. Reason: CONFIGURATION
-
-
Field Summary
Fields Modifier and Type Field static boolean
SKIP_ON_SECTION_URL_EXCEPTION
-
-
-
Field Detail
-
SKIP_ON_SECTION_URL_EXCEPTION
public static boolean SKIP_ON_SECTION_URL_EXCEPTION
This is astatic boolean
configuration field. When this is set to TRUE, if one of the"Section URL's"
provided to this class is not valid, and generates a404 FileNotFoundException
, or some otherHttpConnection
exception, those exceptions will simply be logged, and quietly ignored.
When thisflag
is set to FALSE, any problems that can occur when attempting to pick out News ArticleURL's
from aSection Web-Page
will cause aSectionURLException
to throw, and theScrapeURL's
process will halt.
SIMPLY PUT: There are occasions when a news web-site will remove a section such as "Commerce", "Sports", or "Travel" - and when or if one of these suddenly goes missing, it is better to just skip the site rather than halting the scrape, keep thisflag
set to TRUE.
ALSO: This is, indeed, apublic
andstatic flag
(field) which does mean that all processes (Thread's
) usingclass ScrapeURLs
must share the same setting (simultaneously). This particularflag
CANNOT be changed in aThread-Safe
manner.- Code:
- Exact Field Declaration Expression:
public static boolean SKIP_ON_SECTION_URL_EXCEPTION = true;
-
-
Method Detail
-
get
public static java.util.Vector<java.util.Vector<java.lang.String>> get (NewsSite ns, java.lang.Appendable log) throws java.io.IOException
- Code:
- Exact Method Body:
return get(ns.sectionURLsVec(), ns.filter, ns.linksGetter, log);
-
get
public static java.util.Vector<java.util.Vector<java.lang.String>> get (java.util.Vector<java.net.URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, java.lang.Appendable log)
This class is used to retrieve all of the available articleURL
links found on all sections of a newspaper website.- Parameters:
sectionURLs
- This should be a vector ofURL's
, that has all of the the "Main News-Paper Page Sections." Typical NewsPaper Sections are things like: Life, Sports, Business, World, Economy, Arts, etc... This parameter may not be null, or aNullPointerException
will throw.articleURLFilter
- If there is a standard pattern for a URL that must be avoided, then this filter parameter should be used. This parameter may be null, and if it is, it shall be ignored. This JavaURL-Predicate
(an instance ofPredicate<URL>
) should return TRUE if a particularURL
needs to be kept, not filtered. When thisPredicate
evaluates to FALSE - theURL
will be filtered.
NOTE: This behavior is identical to the Java Stream's method"filter(Predicate<>)".
ALSO:URL's
that are filtered will neither be scraped, nor saved, into the newspaper article result-set output file.linksGetter
- This method may be used to retrieve all links on a particular section-page. This parameter may be null. If it is null, it will be ignored - and all HTML Anchor (<A HREF=...>
) links will be considered "Newspaper Articles to be scraped." Be careful about ignoring this parameter, because there may be many extraneous non-news-article links on a particular Internet News WebSite or inside a Web-Page Section.log
- This prints log information to the screen. This parameter may not be null, or aNullPointerException
will throw. This parameter expects an implementation of Java'sinterface java.lang.Appendable
which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
IMPORTANT: Theinterface Appendable
requires that the check exceptionIOException
must be caught when using itsappend(CharSequence)
methods.- Returns:
- The "
Vector
ofVector's
" that is returned is simply a list of all newspaper anchor-linkURL's
found on each Newspaper Sub-SectionURL
passed to the'sectionURLs'
parameter. The returned "Vector
ofVector's
" is parallel to the input-parameterVector<URL> Section-URL's
.
What this means is that the Newspaper-ArticleURL
-Links scraped from the page located atsectionURLs.elementAt(0)
- will be stored in the return-Vector
atret.elementAt(0).
The articleURL's
scraped off of pageURL
fromsectionURLs.elementAt(1)
will be stored in the return-Vector
atret.elementAt(1)
. And so on, and so forth... - Throws:
SectionURLException
- If one of the providedsectionURL's
(Life, Sports, Travel, etc...) is not valid, or not available on the page then this exception will throw. Note, though, however there is aflag
(SKIP_ON_SECTION_URL_EXCEPTION
) that will force this method to simply "skip" a faulty or non-availableSection URL
, and move on to the next news-article section.
By default, thisflag
is set to TRUE, meaning that this method will skip news-paper sections that have been temporarily removed rather than causing the method to exit. This default behavior can be changed by setting theflag
FALSE.- Code:
- Exact Method Body:
LOG_WRITE( log, "\n" + BRED + "*****************************************************************************************\n" + "*****************************************************************************************\n" + RESET + " Finding Article URL's in Newspaper Sections" + BRED + "\n" + "*****************************************************************************************\n" + "*****************************************************************************************\n" + RESET + '\n' ); Vector<Vector<String>> ret = new Vector<>(); for (URL sectionURL : sectionURLs) { Stream<String> urlStream; // It helps to run this, because web-pages can use a lot of strings System.gc(); // Starting Scraping the Section for URL's LOG_WRITE(log, "Visiting Section URL: " + sectionURL.toString() + '\n'); try { // Download, Scrape & Parse the main-page or section URL. Vector<HTMLNode> sectionPage = HTMLPage.getPageTokens(sectionURL, false); // If the 'LinksGet' instances is null, then select all URL's on the main-page // section-pge, and pray for rain (hope for the best). If no 'LinksGet' instance // was provided, there will likely be many spurious / irrelevant links to // non-article pages, and even advertisement pages that are also included in this // Stream<String>. // // InnerTagGet returns a Vector<TagNode>. Convert that to a Stream<String>, where // each 'String' in the 'Stream' is the HREF attribute of the <A HREF=...> tag. if (linksGetter == null) urlStream = InnerTagGet.all(sectionPage, "a", "href") .stream().map((TagNode tn) -> tn.AV("href")); else urlStream = linksGetter.apply(sectionURL, sectionPage).stream(); } catch (Exception e) { LOG_WRITE( log, BRED + "Error loading this main-section page-URL\n" + RESET + e.getMessage() + '\n' ); if (SKIP_ON_SECTION_URL_EXCEPTION) { LOG_WRITE(log, "Non-fatal Exception, continuing to next Section URL.\n\n"); continue; } else { LOG_WRITE( log, BRED + "Fatal - Exiting. Top-Level Section URL's must be valid URL's." + RESET + "\n" + HTTPCodes.convertMessageVerbose(e, sectionURL, 0) + '\n' ); throw new SectionURLException ("Invalid Main Section URL: " + sectionURL.toString(), e); } } Vector<String> sectionArticleURLs = urlStream // If any TagNode's did not have HREF-Attributes, remove those null-values .filter ((String href) -> (href != null)) // Perform a Standard String.trim() operation. .map ((String href) -> href.trim()) // Any HREF's that are "just white-space" are now removed. .filter ((String href) -> href.length() > 0) // This removes any HREF Attribute values that begin with // "mailto:" "tel:" "javascript:" "magnet:" etc... .filter ((String href) -> StrCmpr.startsWithNAND_CI(href, Links.NON_URL_HREFS())) // Now, Resolve any "Partial URL References" .map ((String href) -> Links.resolve_KE(href, sectionURL)) // If there were any exceptions while performing the Partial-URL Resolve-Operation, // then print an error message. .peek ((Ret2<URL, MalformedURLException> r2) -> { if (r2.b != null) LOG_WRITE( log, "Section URL was a malformed URL, and provided exception messsage:\n" + r2.b.getMessage() + '\n' ); }) // Convert the Ret2<URL, Exception> to just the URL, without any Exceptions .map ((Ret2<URL, MalformedURLException> r2) -> r2.a) // If there was an exception, the URL Ret.a field would be null (remove nulls) .filter ((URL url) -> url != null) // NOTE: When this evaluates to TRUE - it should be kept // Java Stream's 'filter' method states that when the predicate evaluates to TRUE, // the stream element is KEPT / RETAINED. // // Class URLFilter mimics the filter behavior of Streams.filter(...) .filter ((URL url) -> (articleURLFilter == null) || articleURLFilter.test(url)) // Convert these to "Standard Strings" // Case-Insensitive parts are set to LowerCase // Case Sensitive Parts are left alone. .map ((URL url) -> URLs.urlToString(url)) // Filter any duplicates -> This is the reason for the above case-sensitive parts // being separated. .distinct() // Convert the URL's back to a String. There really should not be any exceptions, // This is just an "extra-careful" step. It is not needed. .filter ((String url) -> { try { new URL(url); return true; } catch (Exception e) { return false; } }) // Convert the Stream to a Vector .collect(Collectors.toCollection(Vector::new)); ret.add(sectionArticleURLs); LOG_WRITE( log, "Found [" + BYELLOW + sectionArticleURLs.size() + RESET + "] " + "Article Links.\n\n" ); } // Provide a simple count to the log output on how many URL's have been uncovered. // NOTE: This does not heed whether different sections contain non-distinct URL's. // (An identical URL found in two different sections will be counted twice!) int totalURLs = 0; // <?> Prevents the "Xlint:all" from generating warnings... for (Vector<?> section : ret) totalURLs += section.size(); LOG_WRITE( log, "Complete Possible Article URL list has: " + BYELLOW + StringParse.zeroPad10e4(totalURLs) + RESET + ' ' + "url(s).\n\n" ); return ret;
-
-