Package Torello.HTML.Tools.NewsSite
Class Article
- java.lang.Object
-
- Torello.HTML.Tools.NewsSite.Article
-
- All Implemented Interfaces:
java.io.Serializable
public class Article extends java.lang.Object implements java.io.Serializable
When a news article is downloaded from aURL
, its contents are parsed, and the information-HTML is stored in this class.
This class will store the results from downloading / scraping a news-article from a news-site. Instances of this class are produced by calls to theclass ScrapeArticles
. These results can be saved to a vector, or stored to the File-System for later use. Internally they can contain the original News-Site Article Web-page, and the paired down Article-Body Web-Page.- See Also:
- Serialized Form
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/Article.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/Article.java
File Size: 4,726 Bytes Line Count: 123 '\n' Characters Found
-
-
Field Summary
Serializable ID Modifier and Type Field protected static long
serialVersionUID
Primary Article Data Modifier and Type Field Vector<HTMLNode>
articleBody
Vector<HTMLNode>
originalPage
String
titleElement
URL
url
boolean
wasErrorDownload
Article Image Data Modifier and Type Field int[]
imagePosArr
Vector<URL>
imageURLs
Torello.HTML.PageStats Modifier and Type Field PageStats
originalPageStats
PageStats
processedArticleStats
-
-
-
Field Detail
-
serialVersionUID
protected static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable
. Using theSerializable
Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
protected static final long serialVersionUID = 1;
-
wasErrorDownload
public final boolean wasErrorDownload
This should inform the user that an error occurred when downloading an article. If this field, after instantiation isTRUE
, all other fields in this class should be thought of as "irrelevant."
-
url
public final java.net.URL url
This is the article's URL from the news website.
-
titleElement
public final java.lang.String titleElement
This is the title that was scraped from the main page. The title is the content of the<TITLE>...</TITLE>
element on the article HTML page.
-
originalPage
public final java.util.Vector<HTMLNode> originalPage
This is the original, and complete, HTML vectorized-page download. It contains the original, un-modified, article download.
-
articleBody
public final java.util.Vector<HTMLNode> articleBody
This is the pared down article-body. It is what is retrieved fromclass ArticleGet
-
imageURLs
public final java.util.Vector<java.net.URL> imageURLs
The image-URL's that were found in the news-article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading the article:
Vector<TagNode> imageNodes = TagNodeGet.all(article, TC.OpeningTags, "img"); Vector<URL> imageURLs = Links.resolveSRCs(imageNodes, articleURL); // The results of the above call are stored in this field / Vector<URL>.
-
imagePosArr
public final int[] imagePosArr
This list contains the "Image Positions" inside the vectorized-article for each image that was found inside the article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading that article:
int[] imagePosArr = TagNodeFind.all(page, TC.OpeningTags, "img");
-
originalPageStats
public final PageStats originalPageStats
This contains an instance ofclass PageStats
that has been generated out of an original Newspaper Article Page.
Java Line of Code:
this.originalPageStats = new PageStats(originalPage);
-
processedArticleStats
public final PageStats processedArticleStats
This contains an instance ofclass PageStats
that has been generated from the post-processed Newspaper Article.
Java Line of Code:
this.processedArticleStats = new PageStats(articleBody);
-
-
Constructor Detail
-
Article
public Article(java.net.URL url, java.lang.String titleElement, java.util.Vector<HTMLNode> originalPage, java.util.Vector<HTMLNode> articleBody, java.util.Vector<java.net.URL> imageURLs, int[] imagePosArr)
Builds an instance of this class.- Parameters:
url
- The web-address from whence this news-article was downloaded / retrieved.titleElement
- The contents of the HTML<TITLE>
tag, as aString
.originalPage
- Vectorized-HTML of the original article web-page, in its entirety.articleBody
- Vectorized-HTML of the body of the article's page, as extracted by theArticleGet
function-pointer.imageURLs
- A list of all HTML<IMG>
elements found inside the'articleBody'
imagePosArr
- TheVector
-indices where the images (if any) were found in the article.
-
-