Class Article

  • All Implemented Interfaces:
    java.io.Serializable

    public class Article
    extends java.lang.Object
    implements java.io.Serializable
    When a news article is downloaded from a URL, its contents are parsed, and the information-HTML is stored in this class.

    This class will store the results from downloading / scraping a news-article from a news-site. Instances of this class are produced by calls to the class ScrapeArticles. These results can be saved to a vector, or stored to the File-System for later use. Internally they can contain the original News-Site Article Web-page, and the paired down Article-Body Web-Page.
    See Also:
    Serialized Form


    • Field Detail

      • serialVersionUID

        🡇    
        protected static final long serialVersionUID
        This fulfils the SerialVersion UID requirement for all classes that implement Java's interface java.io.Serializable. Using the Serializable Implementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        protected static final long serialVersionUID = 1;
        
      • wasErrorDownload

        🡅  🡇    
        public final boolean wasErrorDownload
        This should inform the user that an error occurred when downloading an article. If this field, after instantiation is TRUE, all other fields in this class should be thought of as "irrelevant."
        Code:
        Exact Field Declaration Expression:
        public final boolean                wasErrorDownload;
        
      • url

        🡅  🡇    
        public final java.net.URL url
        This is the article's URL from the news website.
        Code:
        Exact Field Declaration Expression:
        public final URL                    url;
        
      • titleElement

        🡅  🡇    
        public final java.lang.String titleElement
        This is the title that was scraped from the main page. The title is the content of the <TITLE>...</TITLE> element on the article HTML page.
        Code:
        Exact Field Declaration Expression:
        public final String                 titleElement;
        
      • originalPage

        🡅  🡇    
        public final java.util.Vector<HTMLNode> originalPage
        This is the original, and complete, HTML vectorized-page download. It contains the original, un-modified, article download.
        Code:
        Exact Field Declaration Expression:
        public final Vector<HTMLNode>       originalPage;
        
      • articleBody

        🡅  🡇    
        public final java.util.Vector<HTMLNode> articleBody
        This is the pared down article-body. It is what is retrieved from class ArticleGet
        Code:
        Exact Field Declaration Expression:
        public final Vector<HTMLNode>       articleBody;
        
      • imageURLs

        🡅  🡇    
        public final java.util.Vector<java.net.URL> imageURLs
        The image-URL's that were found in the news-article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading the article:


         Vector<TagNode> imageNodes  = TagNodeGet.all(article, TC.OpeningTags, "img");
         Vector<URL>     imageURLs   = Links.resolveSRCs(imageNodes, articleURL);
         
         // The results of the above call are stored in this field / Vector<URL>.
        
        Code:
        Exact Field Declaration Expression:
        public final Vector<URL>            imageURLs;
        
      • imagePosArr

        🡅  🡇    
        public final int[] imagePosArr
        This list contains the "Image Positions" inside the vectorized-article for each image that was found inside the article. The easiest way to think about this field is that the following instructions were called on the article-body after downloading that article:


          int[] imagePosArr = TagNodeFind.all(page, TC.OpeningTags, "img");
        
        Code:
        Exact Field Declaration Expression:
        public final int[]                  imagePosArr;
        
      • originalPageStats

        🡅  🡇    
        public final PageStats originalPageStats
        This contains an instance of class PageStats that has been generated out of an original Newspaper Article Page.

        Java Line of Code:
         this.originalPageStats = new PageStats(originalPage);
        
        Code:
        Exact Field Declaration Expression:
        public final PageStats              originalPageStats;
        
      • processedArticleStats

        🡅  🡇    
        public final PageStats processedArticleStats
        This contains an instance of class PageStats that has been generated from the post-processed Newspaper Article.

        Java Line of Code:
         this.processedArticleStats = new PageStats(articleBody);
        
        Code:
        Exact Field Declaration Expression:
        public final PageStats              processedArticleStats;
        
    • Constructor Detail

      • Article

        🡅    
        public Article​(java.net.URL url,
                       java.lang.String titleElement,
                       java.util.Vector<HTMLNode> originalPage,
                       java.util.Vector<HTMLNode> articleBody,
                       java.util.Vector<java.net.URL> imageURLs,
                       int[] imagePosArr)
        Builds an instance of this class.
        Parameters:
        url - The web-address from whence this news-article was downloaded / retrieved.
        titleElement - The contents of the HTML <TITLE> tag, as a String.
        originalPage - Vectorized-HTML of the original article web-page, in its entirety.
        articleBody - Vectorized-HTML of the body of the article's page, as extracted by the ArticleGet function-pointer.
        imageURLs - A list of all HTML <IMG> elements found inside the 'articleBody'
        imagePosArr - The Vector-indices where the images (if any) were found in the article.