Torello.HTML.Tools.NewsSite

Utilities for scraping news web-sites. Scraping is performed in two steps. The first is retrieving Article URL's from the main-page and sub-sections of the newspaper site. The second is for retrieving the Article's themselves. The articles are saved to disk, unless a specialized ScrapedArticleReceiver is provided, and they are encoded using Java's Serializable routines. A method is provided for converting these data-files to '.html' files, and for retrieving / 'localizing' the images encountered on the Article-pages.

News-Scrape Main A.P.I.

Java Entity	Description
ScrapeURLs	Collects all news-article `URL's` from a news oriented web-site's main web-page and from the list 'sub-section' web-pages
ScrapeArticles	This class runs the primary iteration-loop for downloading news-articles using a list of article-`URL's`
ToHTML	Converts Serialized Object Files of HTML-Vectors into `'.html'` Files, and can also be used to do any user-defined, customized post-processing (using a function-pointer) on news-articles (after downloading them)

News Data Classes
Java Entity	Description
Article	When a news article is downloaded from a `URL`, its contents are parsed, and the information-HTML is stored in this class
NewsSite	The 'data flow' encapsulation class that contains most of the salient features of a news oriented web-site
NewsSites	This class is nothing more than an 'Example Class' that contains some foreign-language based news web-pages, from both overseas and from Latin America

Function-Pointer / Lambda-Targets
Java Entity	Description
ArticleGet	A function-pointer / lambda target for extracting an article's content from the web-page from whence it was downloaded; including several `static`-builder methods for the most common means of finding the HTML-Tags that wrap artilce-HTML on news-media websites
HTMLModifier	A simple Java function-pointer / lambda-target that may be used to modify or alter Vectorized-HTML, in any way that the programmer has deemed necessary
LinksGet	This function-pointer / lambda-target interface which facilitates extracting news-article `URL's` on the main-page (or a sub-sections) of a news-media web-site
Pause	When the main iteration-loop for downloading news-articles is running, the loop-variables are kept current to this class; so if (while watching the downloader), the programmer has decided to go take a break (and presses `Control-^C`), 'download progress' won't be lost and starting over with articles that have already been saved won't be necessary
ScrapedArticleReceiver	A Java function-pointer / lambda-target that provides a means for deciding where to save downloaded article HTML, including a `static`-builder method for choosing to save articles directly to the file-system

Enum
Java Entity	Description
DownloadResult	An enumeration of the various problem that could potentially flare up when downloading news article HTML

Exception
Java Entity	Description
ArticleGetException	An exception that is thrown from inside the `lambda-methods` created by the class `ArticleGet` methods - if an error occurs while retrieving a news article body from inside a news web-page
NewsSiteException	This exception is thrown by the `NewsSite` class' constructor if improper input-parameter data is provided to that constructor
PauseException	Thrown by the `'Pause'` interface, if any of the methods inside an implementation of interface `'Pause'` need to throw an exception - then that exception must be wraped by this (unchecked, runtime) exception
ReceiveException	Thrown by the `'ScrapedArticleReceiver'` interface, if any of the methods inside an implementation of interface `'ScrapedArticleReceiver'` need to throw an exception - then that exception must be wraped by this (unchecked, runtime) exception
SectionURLException	If there is an error while scraping a news-site for `URL's`, then this exception throws

Package Torello.HTML.Tools.NewsSite