Package Torello.HTML.Tools.NewsSite
Utilities for scraping news web-sites. Scraping is performed in two steps. The first
is retrieving
Article URL's
from the main-page and sub-sections of the newspaper
site. The second is for retrieving the Article's
themselves. The articles
are saved to disk, unless a specialized ScrapedArticleReceiver
is provided, and
they are encoded using Java's Serializable
routines. A method is provided for
converting these data-files to '.html'
files, and for retrieving /
'localizing'
the images encountered on the Article
-pages.-
News-Scrape Main A.P.I. Java Entity Description ScrapeURLs Collects all news-articleURL's
from a news oriented web-site's main web-page and from the list 'sub-section' web-pagesScrapeArticles This class runs the primary iteration-loop for downloading news-articles using a list of article-URL's
ToHTML Converts Serialized Object Files of HTML-Vectors into'.html'
Files, and can also be used to do any user-defined, customized post-processing (using a function-pointer) on news-articles (after downloading them)News Data Classes Java Entity Description Article When a news article is downloaded from aURL
, its contents are parsed, and the information-HTML is stored in this classNewsSite The 'data flow' encapsulation class that contains most of the salient features of a news oriented web-siteNewsSites This class is nothing more than an 'Example Class' that contains some foreign-language based news web-pages, from both overseas and from Latin AmericaFunction-Pointer / Lambda-Targets Java Entity Description ArticleGet A function-pointer / lambda target for extracting an article's content from the web-page from whence it was downloaded; including severalstatic
-builder methods for the most common means of finding the HTML-Tags that wrap artilce-HTML on news-media websitesHTMLModifier A simple Java function-pointer / lambda-target that may be used to modify or alter Vectorized-HTML, in any way that the programmer has deemed necessaryLinksGet This function-pointer / lambda-target interface which facilitates extracting news-articleURL's
on the main-page (or a sub-sections) of a news-media web-sitePause When the main iteration-loop for downloading news-articles is running, the loop-variables are kept current to this class; so if (while watching the downloader), the programmer has decided to go take a break (and pressesControl-^C
), 'download progress' won't be lost and starting over with articles that have already been saved won't be necessaryScrapedArticleReceiver A Java function-pointer / lambda-target that provides a means for deciding where to save downloaded article HTML, including astatic
-builder method for choosing to save articles directly to the file-systemEnum Java Entity Description DownloadResult An enumeration of the various problem that could potentially flare up when downloading news article HTMLException Java Entity Description ArticleGetException An exception that is thrown from inside thelambda-methods
created by the classArticleGet
methods - if an error occurs while retrieving a news article body from inside a news web-pageNewsSiteException This exception is thrown by theNewsSite
class' constructor if improper input-parameter data is provided to that constructorPauseException Thrown by the'Pause'
interface, if any of the methods inside an implementation of interface'Pause'
need to throw an exception - then that exception must be wraped by this (unchecked, runtime) exceptionReceiveException Thrown by the'ScrapedArticleReceiver'
interface, if any of the methods inside an implementation of interface'ScrapedArticleReceiver'
need to throw an exception - then that exception must be wraped by this (unchecked, runtime) exceptionSectionURLException If there is an error while scraping a news-site forURL's
, then this exception throws