Package Torello.HTML.Tools.NewsSite
Class Article
- java.lang.Object
-
- Torello.HTML.Tools.NewsSite.Article
-
- All Implemented Interfaces:
java.io.Serializable
public class Article extends java.lang.Object implements java.io.Serializable
When a news article is downloaded from aURL
, its contents are parsed, and the information-HTML is stored in this class.
This class will store the results from downloading / scraping a news-article from a news-site. Instances of this class are produced by calls to theclass ScrapeArticles
. These results can be saved to a vector, or stored to the File-System for later use. Internally they can contain the original News-Site Article Web-page, and the paired down Article-Body Web-Page.- See Also:
- Serialized Form
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/NewsSite/Article.java
- Open New Browser-Tab: Torello/HTML/Tools/NewsSite/Article.java
-
-
Field Summary
Serializable ID Modifier and Type Field protected static long
serialVersionUID
Primary Article Data Modifier and Type Field Vector<HTMLNode>
articleBody
Vector<HTMLNode>
originalPage
String
titleElement
URL
url
boolean
wasErrorDownload
Article Image Data Modifier and Type Field int[]
imagePosArr
Vector<URL>
imageURLs
Torello.HTML.PageStats Modifier and Type Field PageStats
originalPageStats
PageStats
processedArticleStats
-
-
-
Field Detail
-
serialVersionUID
protected static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable
. Using theSerializable
Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
protected static final long serialVersionUID = 1;
-
wasErrorDownload
public final boolean wasErrorDownload
This should inform the user that an error occurred when downloading an article. If this field, after instantiation is TRUE, all other fields in this class should be thought of as "irrelevant."- Code:
- Exact Field Declaration Expression:
public final boolean wasErrorDownload;
-
url
public final java.net.URL url
This is the article's URL from the news website.- Code:
- Exact Field Declaration Expression:
public final URL url;
-
titleElement
public final java.lang.String titleElement
This is the title that was scraped from the main page. The title is the content of the<TITLE>...</TITLE>
element on the article HTML page.- Code:
- Exact Field Declaration Expression:
public final String titleElement;
-
originalPage
public final java.util.Vector<HTMLNode> originalPage
This is the original, and complete, HTML vectorized-page download. It contains the original, un-modified, article download.- Code:
- Exact Field Declaration Expression:
public final Vector<HTMLNode> originalPage;
-
articleBody
public final java.util.Vector<HTMLNode> articleBody
This is the pared down article-body. It is what is retrieved fromclass ArticleGet
- Code:
- Exact Field Declaration Expression:
public final Vector<HTMLNode> articleBody;
-
imageURLs
public final java.util.Vector<java.net.URL> imageURLs
The image-URL's that were found in the news-article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading the article:
Vector<TagNode> imageNodes = TagNodeGet.all(article, TC.OpeningTags, "img"); Vector<URL> imageURLs = Links.resolveSRCs(imageNodes, articleURL); // The results of the above call are stored in this field / Vector<URL>.
- Code:
- Exact Field Declaration Expression:
public final Vector<URL> imageURLs;
-
imagePosArr
public final int[] imagePosArr
This list contains the "Image Positions" inside the vectorized-article for each image that was found inside the article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading that article:
int[] imagePosArr = TagNodeFind.all(page, TC.OpeningTags, "img");
- Code:
- Exact Field Declaration Expression:
public final int[] imagePosArr;
-
originalPageStats
public final PageStats originalPageStats
This contains an instance ofclass PageStats
that has been generated out of an original Newspaper Article Page.
Java Line of Code:
this.originalPageStats = new PageStats(originalPage);
- Code:
- Exact Field Declaration Expression:
public final PageStats originalPageStats;
-
processedArticleStats
public final PageStats processedArticleStats
This contains an instance ofclass PageStats
that has been generated from the post-processed Newspaper Article.
Java Line of Code:
this.processedArticleStats = new PageStats(articleBody);
- Code:
- Exact Field Declaration Expression:
public final PageStats processedArticleStats;
-
-
Constructor Detail
-
Article
public Article(java.net.URL url, java.lang.String titleElement, java.util.Vector<HTMLNode> originalPage, java.util.Vector<HTMLNode> articleBody, java.util.Vector<java.net.URL> imageURLs, int[] imagePosArr)
Builds an instance of this class.- Parameters:
url
- The web-address from whence this news-article was downloaded / retrieved.titleElement
- The contents of the HTML<TITLE>
tag, as aString
.originalPage
- Vectorized-HTML of the original article web-page, in its entirety.articleBody
- Vectorized-HTML of the body of the article's page, as extracted by theArticleGet
function-pointer.imageURLs
- A list of all HTML<IMG>
elements found inside the'articleBody'
imagePosArr
- TheVector
-indices where the images (if any) were found in the article.
-
-