public class ScrapeURLs extends java.lang.ObjectCollects all news-article
URL'sfrom a news oriented web-site's main web-page and from the list 'sub-section' web-pages.
The primary purpose of this class is to scrape the relevant news-paper articles from an Internet News Web-Site. These article URL's are returned inside of a "Vector of Vectors." Primarily, most news-based web-sites on the Internet have, since their founding, divided different news-articles into separate "sub-sections." This HTML Search, Parse and Scrape package was written to help download and translate news-articles from web-sites that appear to be from overseas and across the oceans. Generally, going to the top-level news-site web-page is not enough to retrieve all relevant news-articles that are available on the page for any given day of the week. The primary purpose of this class is to visit each of the "News Sections" available on the page, and scrape those URL's and return them.
The "Vector of Vectors" that is returned by the primary
get(...)method is designed to return a list of all news-URL's that are available for each of the separate "news sections" that are identified on the primary web-page. The list of news-sections are expected to be provided to this class
get(...)method via the parameter
sectionURLs. In addition to this list of sections to scrape, the user should specify an instance of
URLFilterthat tells the scraper-logic which URL's to ignore. For most of the news-sites that have been tested with this package all non-advertising and "related article URL's" have a very specific pattern that can be identified with a regular expression. There is even an instance of
class LinksGetif more work needs to be done when retrieving and identifying which URL's are relevant.
Perhaps the user may wonder what work this class is actually doing if it is necessary to provided instances of
URLFilterand a Vector
'sectionURLs'- ... and the answer is not a lot! This class is actually very short, and just ensures that as much error checking as possible is done, and that the returned vector has been checked for valid URL's, and all nulls eliminated!
Here is an example "URL Retrieve" operation on the Mandarin Chinese Language Government Web Portal available in North America. Translating these pages for study about the politics and technology from the other side of the Pacific Ocean was the primary impetus for developing the Java-HTML JAR Library.
// Sample Article URL from the Chinese National Web-Portal - all valid articles have the basic pattern // http://www.gov.cn/xinwen/2020-07/17/content_5527889.htm // This "Regular Expression" will match any News Article URL that "looks like" the above URL. String articleURLRegExStr = "http://www.gov.cn/xinwen/\\d\\d\\d\\d-\\d\\d/\\d\\d/content_\\d+.html?"; Pattern articleURLsRegEx = Pattern.compile(articleURLRegExStr); Vector<URL> sectionURLs = new Vector<>(); // For the purposes of this example, only one section of the 'www.Gov.CN/' web-portal will be // visited. There are other "Newspaper SubSections" that could easily be added to this Vector. // If more sections were added, more news-article URL's would likely be found, identified and // returned. sectionURLs.add(new URL("https://www.gov.cn/")); // The factory class StrFilter may look complicated, but it's methods are simple, but encompass // a lot of error-checking which may look verbose, or complex. URLFilter filter = URLFilter.fromStrFilter(StrFilter.regExKEEP(articleURLsRegEx, true)); Vector<Vector<String>> articleURLs = ScrapeURLs.get(sectionURLs, filter, null, sw); // This will write every article URL to a text file called "urls.txt" FileRW.writeFile(articleURLs.elementAt(0), "urls.txt"); // This will write the article-URL's vector to a serialized-object data-file called "urls.vdat" FileRW.writeObjectToFile(articleURLs, "urls.vdat", true);
'urls.vdat'file that was created can easily be retrieved using Java's de-serialization streams. If the
cast(below) were necessary, then an
annotationof the format
@SuppressWarnings("unchecked")would be required.
Java Line of Code:
Vector<URL> urls = (Vector<URL>) FileRW.readObjectFromFile("urls.vdat", Vector.class, true);
- View Here: Torello/HTML/Tools/NewsSite/ScrapeURLs.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/ScrapeURLs.java
Stateless Class: This class neither contains any program-state, nor can it be instantiated. The
@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.
Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member field. It is very similar to the Java-Bean
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 3 Method(s), 3 declared static
- 1 Field(s), 1 declared static, 0 declared final
- Fields excused from final modifier (with explanation):
Field 'SKIP_ON_SECTION_URL_EXCEPTION' is not final. Reason: CONFIGURATION
Fields Modifier and Type Field
Retrieve Articles URL's using a LinksGet and Optional Filter Modifier and Type Method
get(Vector<URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, Appendable log)
get(NewsSite ns, Appendable log)
public static boolean SKIP_ON_SECTION_URL_EXCEPTIONThis is a
static booleanconfiguration field. When this is set to TRUE, if one of the
"Section URL's"provided to this class is not valid, and generates a
404 FileNotFoundException, or some other
HttpConnectionexception, those exceptions will simply be logged, and quietly ignored.
flagis set to FALSE, any problems that can occur when attempting to pick out News Article
Section Web-Pagewill cause a
SectionURLExceptionto throw, and the
ScrapeURL'sprocess will halt.
SIMPLY PUT: There are occasions when a news web-site will remove a section such as "Commerce", "Sports", or "Travel" - and when or if one of these suddenly goes missing, it is better to just skip the site rather than halting the scrape, keep this
flagset to TRUE.
ALSO: This is, indeed, a
static flag(field) which does mean that all processes (
class ScrapeURLsmust share the same setting (simultaneously). This particular
flagCANNOT be changed in a
- Exact Field Declaration Expression:
public static boolean SKIP_ON_SECTION_URL_EXCEPTION = true;
public static java.util.Vector<java.util.Vector<java.lang.String>> get (NewsSite ns, java.lang.Appendable log) throws java.io.IOExceptionConvenience Method
get(Vector, URLFilter, LinksGet, Appendable)
- Exact Method Body:
return get(ns.sectionURLsVec(), ns.filter, ns.linksGetter, log);
public static java.util.Vector<java.util.Vector<java.lang.String>> get (java.util.Vector<java.net.URL> sectionURLs, URLFilter articleURLFilter, LinksGet linksGetter, java.lang.Appendable log)This class is used to retrieve all of the available article
URLlinks found on all sections of a newspaper website.
sectionURLs- This should be a vector of
URL's, that has all of the the "Main News-Paper Page Sections." Typical NewsPaper Sections are things like: Life, Sports, Business, World, Economy, Arts, etc... This parameter may not be null, or a
articleURLFilter- If there is a standard pattern for a URL that must be avoided, then this filter parameter should be used. This parameter may be null, and if it is, it shall be ignored. This Java
URL-Predicate(an instance of
Predicate<URL>) should return TRUE if a particular
URLneeds to be kept, not filtered. When this
Predicateevaluates to FALSE - the
URLwill be filtered.
NOTE: This behavior is identical to the Java Stream's method
URL'sthat are filtered will neither be scraped, nor saved, into the newspaper article result-set output file.
linksGetter- This method may be used to retrieve all links on a particular section-page. This parameter may be null. If it is null, it will be ignored - and all HTML Anchor (
<A HREF=...>) links will be considered "Newspaper Articles to be scraped." Be careful about ignoring this parameter, because there may be many extraneous non-news-article links on a particular Internet News WebSite or inside a Web-Page Section.
log- This prints log information to the screen. This parameter may not be null, or a
NullPointerExceptionwill throw. This parameter expects an implementation of Java's
interface java.lang.Appendablewhich allows for a wide range of options when logging intermediate messages.
Class or Interface Instance Use & Purpose
Sends text to the standard-out terminal
Sends text to
System.out, and saves it, internally.
FileWriter, PrintWriter, StringWriter
General purpose java text-output classes
More general-purpose java text-output classes
interface Appendablerequires that the check exception
IOExceptionmust be caught when using its
- The "
Vector's" that is returned is simply a list of all newspaper anchor-link
URL'sfound on each Newspaper Sub-Section
URLpassed to the
'sectionURLs'parameter. The returned "
Vector's" is parallel to the input-parameter
What this means is that the Newspaper-Article
URL-Links scraped from the page located at
sectionURLs.elementAt(0)- will be stored in the return-
URL'sscraped off of page
sectionURLs.elementAt(1)will be stored in the return-
ret.elementAt(1). And so on, and so forth...
SectionURLException- If one of the provided
sectionURL's(Life, Sports, Travel, etc...) is not valid, or not available on the page then this exception will throw. Note, though, however there is a
SKIP_ON_SECTION_URL_EXCEPTION) that will force this method to simply "skip" a faulty or non-available
Section URL, and move on to the next news-article section.
By default, this
flagis set to TRUE, meaning that this method will skip news-paper sections that have been temporarily removed rather than causing the method to exit. This default behavior can be changed by setting the
- Exact Method Body: