public class ForeignNewsArticle extends java.lang.ObjectA simple Foreign News Article Scraper.
This class will easily translate the contents of a news-article that is any any language that may be translated using the Google Cloud Server Translate API into English. This does a very simple rendition of translation. It expects the user of this class to "pick out the article content" and providing that vectorized-HTML sub-page to the
processArticle(...)method of this class.
This class will:
- Translate the text from the native-language to English.
- Generate a side-by-side article with both original-language and English article content
- Save the page as an
"index.html"file in the user-specified directory
- Download any photos present in the HTML
- Re-name the photo file-names, after downloading them to a local user-specified directory.
- Update the page HTML
<IMG SRC="...">nodes accordingly with the new image names.
In order to Translate a Foreign Language News Article into English or Spanish - this is the only class that is really needed. It does a "simple-translation" using the Google Cloud Server Translate API.
IMPERATIVE: This class makes calls to the GCSTAPI, and therefore, Google is going to want an "API Key" so that it can bill your account for the translations. It has been explained that this Java package is not going to eat your API-key, but indeed it is going to expect one for these classes to work.
class GCSTAPIhas a field simple called
public static String keythat needs to be set to a valid GCS Translate API Key, because otherwise the API calls will fail. You may read more about this on Google's website, and in the class Torello.Languages.GCSTAPI.
FINALLY: This class makes calls to
class ImageScraper, which uses a Time-Out monitor-thread to prevent locking up when downloading images. However, when your program exists, it may sit idle for anywhere between 1 second and 1 minute, because the Java JRE does not automatically kill all threads - even when program flow exits and terminates.
To solve this problem immediately, call:
- View Here: Torello/Languages/ForeignNewsArticle.java
- Open New Browser-Tab: Torello/Languages/ForeignNewsArticle.java
Stateless Class: This class neither contains any program-state, nor can it be instantiated. The
@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.
Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member field. It is very similar to the Java-Bean
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 1 Method(s), 1 declared static
- 1 Field(s), 1 declared static, 1 declared final
Fields Modifier and Type Field
All Methods Static Methods Concrete Methods Modifier and Type Method
processArticle(Vector<HTMLNode> articleBody, URL url, String title, LC srcLang, Appendable log, String targetDirectory)
public static final java.lang.String HEADERThis is the HTML page header that is appended to the output page.
- Exact Field Declaration Expression:
public static Ret3<java.util.Vector<java.lang.String>,java.util.Vector<java.lang.String>,java.lang.String> processArticle (java.util.Vector<HTMLNode> articleBody, java.net.URL url, java.lang.String title, LC srcLang, java.lang.Appendable log, java.lang.String targetDirectory) throws java.io.IOExceptionThis will download and translate a news article from a foreign news website. All that you need to do is provide the main "Article-Body" of the article, and some information - and calls to Google Cloud Server Translate API will be handled by the code.
IMPORTANT NOTE: This class makes calls to the GCSTAPI, which is an acronym meaning the Google Cloud Server Translate API. This server expects you to pay Google for the services that it provides. The translations are not free - but they are not too expensive either. You must be sure to set the
class GSCTAPI -> String keyfield in order for the GGCS Translate API Queries to succeed.
Your Directory Will Contain:
- Article Photos, stored by number as they appear in the article
index.html- Article Body with Translations
articleBody- This should have the content of the article from the vectorized HTML page. Read more about cleaning an HTML news article in the class ArticleGet.
// Generally retreiving the "Article Body" from a news-article web-page is a 'sort-of' simple // two-step process. // // Step 1: You must look at the web-page in your browser and press your browser's "View Content" // Button. Identify the HTML Divider Element that looks something to the effect of // <DIV CLASS='article_body'> ... or maybe <DIV CLASS='page_content'> // You will have to find the relevant divider, or article element once, and only once, // per website // // Step 2: Grab that content with a simple call to the Inclusive-Get methods in NodeSearch URL url = new URL("https://some.foreign-news.site/some-article.html"); Vector<HTMLNode> articlePage = HTMLPage.getPageTokens(url, false); Vector<HTMLNode> articleBody = InnerTagGetInclusive.first(articlePage, "div", "class", TextComparitor.C, "page-content"); // use whatever tag you have found via the "View Content" // Button on your browser. You only need to find this tag // once per website! // Now pass the 'articleBody' to this 'processArticle' method. // You will also have to retrieve the "Article Title" manually as well. // Hopefully it is obvious that the 'title' could be stored in any number of ways // depending on which site is being viewed. The title location is usually "consistently // the same" as long as your on the same website. String title = "?"; // you must search the page to retrieve the title LC articleLC = LC.es; // Select the (spoken) language used in the article. // This could be LC.vi (Vietnamese), LC.es (Spanish) etc... Ret3<Vector<String>, Vector<String>, String> response = processArticle (articleBody, url, title, articleLC, new StorageWriter(), "outdir/"); // The returned String-Vectors will have the translated sentences and words readily // available for use - if you wish to further process the article-content. // The output directory 'outdir/' will have a readable 'index.html' file, along // with any photos that were found on the page already downloaded so they may be // locally included on the output page.
url- This article's URL to be scraped. This is used, only, for including a link to the articles original page on the output index.html file.
title- This is needed because obtaining the title can be done in myraid ways. If it is kept as an "external option" - this provides more leeway to the coder/programmer.
srcLang- This is just the "two character" language code that Google Cloud Server expects to see.
log- This logs progress to terminal out. Null may be passsed, in which case output will not be displayed. Any implementation of
java.lang.Appendablewill suffice. Make note that the 'Appendable' interface allows / requires heeding IOException's for it's 'append(...)' methods.
targetDirectory- This is the directory where the image-files and 'index.html' file will be stored.
- This will return an instance of:
Ret3<Vector<String>, Vector<String>, String>
This vector contains a list of sentences, or sentence-fragments, in the original language of the news or article.
This vector contains a list of sentences, or sentence-fragments, in the target language, which is english.
This array of strings contains a list of filenames, one for each image that was present on the original news or article page, and therefore downloaded.