Package Torello.HTML.Tools.NewsSite
Class Article
- java.lang.Object
-
- Torello.HTML.Tools.NewsSite.Article
-
- All Implemented Interfaces:
java.io.Serializable
public class Article extends java.lang.Object implements java.io.Serializable
When a news article is downloaded from aURL, its contents are parsed, and the information-HTML is stored in this class.
This class will store the results from downloading / scraping a news-article from a news-site. Instances of this class are produced by calls to theclass ScrapeArticles. These results can be saved to a vector, or stored to the File-System for later use. Internally they can contain the original News-Site Article Web-page, and the paired down Article-Body Web-Page.- See Also:
- Serialized Form
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/Article.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/Article.java
File Size: 4,750 Bytes Line Count: 124 '\n' Characters Found
-
-
Field Summary
Serializable ID Modifier and Type Field protected static longserialVersionUIDPrimary Article Data Modifier and Type Field Vector<HTMLNode>articleBodyVector<HTMLNode>originalPageStringtitleElementURLurlbooleanwasErrorDownloadArticle Image Data Modifier and Type Field int[]imagePosArrVector<URL>imageURLsTorello.HTML.PageStats Modifier and Type Field PageStatsoriginalPageStatsPageStatsprocessedArticleStats
-
-
-
Field Detail
-
serialVersionUID
protected static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable. Using theSerializableImplementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.- See Also:
- Constant Field Values
-
wasErrorDownload
public final boolean wasErrorDownload
This should inform the user that an error occurred when downloading an article. If this field, after instantiation isTRUE, all other fields in this class should be thought of as "irrelevant."
-
url
public final java.net.URL url
This is the article's URL from the news website.
-
titleElement
public final java.lang.String titleElement
This is the title that was scraped from the main page. The title is the content of the<TITLE>...</TITLE>element on the article HTML page.
-
originalPage
public final java.util.Vector<HTMLNode> originalPage
This is the original, and complete, HTML vectorized-page download. It contains the original, un-modified, article download.
-
articleBody
public final java.util.Vector<HTMLNode> articleBody
This is the pared down article-body. It is what is retrieved fromclass ArticleGet
-
imageURLs
public final java.util.Vector<java.net.URL> imageURLs
The image-URL's that were found in the news-article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading the article:
Vector<TagNode> imageNodes = TagNodeGet.all(article, TC.OpeningTags, "img"); Vector<URL> imageURLs = Links.resolveSRCs(imageNodes, articleURL); // The results of the above call are stored in this field / Vector<URL>.
-
imagePosArr
public final int[] imagePosArr
This list contains the "Image Positions" inside the vectorized-article for each image that was found inside the article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading that article:
int[] imagePosArr = TagNodeFind.all(page, TC.OpeningTags, "img");
-
originalPageStats
public final PageStats originalPageStats
This contains an instance ofclass PageStatsthat has been generated out of an original Newspaper Article Page.
Java Line of Code:
this.originalPageStats = new PageStats(originalPage);
-
processedArticleStats
public final PageStats processedArticleStats
This contains an instance ofclass PageStatsthat has been generated from the post-processed Newspaper Article.
Java Line of Code:
this.processedArticleStats = new PageStats(articleBody);
-
-
Constructor Detail
-
Article
public Article(java.net.URL url, java.lang.String titleElement, java.util.Vector<HTMLNode> originalPage, java.util.Vector<HTMLNode> articleBody, java.util.Vector<java.net.URL> imageURLs, int[] imagePosArr)
Builds an instance of this class.- Parameters:
url- The web-address from whence this news-article was downloaded / retrieved.titleElement- The contents of the HTML<TITLE>tag, as aString.originalPage- Vectorized-HTML of the original article web-page, in its entirety.articleBody- Vectorized-HTML of the body of the article's page, as extracted by theArticleGetfunction-pointer.imageURLs- A list of all HTML<IMG>elements found inside the'articleBody'imagePosArr- TheVector-indices where the images (if any) were found in the article.
-
-