Package Torello.Languages
Class ForeignNewsArticle
- java.lang.Object
-
- Torello.Languages.ForeignNewsArticle
-
public class ForeignNewsArticle extends java.lang.Object
A simple Foreign News Article Scraper.
This class will easily translate the contents of a news-article that is any any language that may be translated using the Google Cloud Server Translate API into English. This does a very simple rendition of translation. It expects the user of this class to "pick out the article content" and providing that vectorized-HTML sub-page to theprocessArticle(...)
method of this class.
This class will:- Translate the text from the native-language to English.
- Generate a side-by-side article with both original-language and English article content
- Save the page as an
"index.html"
file in the user-specified directory - Download any photos present in the HTML
- Re-name the photo file-names, after downloading them to a local user-specified directory.
- Update the page HTML
<IMG SRC="...">
nodes accordingly with the new image names.
In order to Translate a Foreign Language News Article into English or Spanish - this is the only class that is really needed. It does a "simple-translation" using the Google Cloud Server Translate API.
IMPERATIVE: This class makes calls to the GCSTAPI, and therefore, Google is going to want an "API Key" so that it can bill your account for the translations. It has been explained that this Java package is not going to eat your API-key, but indeed it is going to expect one for these classes to work.class GCSTAPI
has a field simple calledpublic static String key
that needs to be set to a valid GCS Translate API Key, because otherwise the API calls will fail. You may read more about this on Google's website, and in the class Torello.Languages.GCSTAPI.
FINALLY: This class makes calls toclass ImageScraper
, which uses a Time-Out monitor-thread to prevent locking up when downloading images. However, when your program exists, it may sit idle for anywhere between 1 second and 1 minute, because the Java JRE does not automatically kill all threads - even when program flow exits and terminates.
To solve this problem immediately, call:ImageScraper.shutdownTOThreads();
Hi-Lited Source-Code:- View Here: Torello/Languages/ForeignNewsArticle.java
- Open New Browser-Tab: Torello/Languages/ForeignNewsArticle.java
File Size: 10,943 Bytes Line Count: 238 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 1 Method(s), 1 declared static
- 1 Field(s), 1 declared static, 1 declared final
-
-
Field Summary
Fields Modifier and Type Field static String
HEADER
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method static Ret3<Vector<String>,
Vector<String>,
String[]>processArticle(Vector<HTMLNode> articleBody, URL url, String title, LC srcLang, Appendable log, String targetDirectory)
-
-
-
Field Detail
-
HEADER
public static final java.lang.String HEADER
This is the HTML page header that is appended to the output page.
-
-
Method Detail
-
processArticle
public static Ret3<java.util.Vector<java.lang.String>,java.util.Vector<java.lang.String>,java.lang.String[]> processArticle (java.util.Vector<HTMLNode> articleBody, java.net.URL url, java.lang.String title, LC srcLang, java.lang.Appendable log, java.lang.String targetDirectory) throws java.io.IOException, ImageScraperException
This will download and translate a news article from a foreign news website. All that you need to do is provide the main "Article-Body" of the article, and some information - and calls to Google Cloud Server Translate API will be handled by the code.
IMPORTANT NOTE: This class makes calls to the GCSTAPI, which is an acronym meaning the Google Cloud Server Translate API. This server expects you to pay Google for the services that it provides. The translations are not free - but they are not too expensive either. You must be sure to set theclass GSCTAPI -> String key
field in order for the GGCS Translate API Queries to succeed.
Your Directory Will Contain:- Article Photos, stored by number as they appear in the article
index.html
- Article Body with Translations
- Parameters:
articleBody
- This should have the content of the article from the vectorized HTML page. Read more about cleaning an HTML news article in the class ArticleGet.
Example:
// Generally retreiving the "Article Body" from a news-article web-page is a 'sort-of' simple // two-step process. // // Step 1: You must look at the web-page in your browser and press your browser's "View Content" // Button. Identify the HTML Divider Element that looks something to the effect of // <DIV CLASS='article_body'> ... or maybe <DIV CLASS='page_content'> // You will have to find the relevant divider, or article element once, and only once, // per website // // Step 2: Grab that content with a simple call to the Inclusive-Get methods in NodeSearch URL url = new URL("https://some.foreign-news.site/some-article.html"); Vector<HTMLNode> articlePage = HTMLPage.getPageTokens(url, false); Vector<HTMLNode> articleBody = InnerTagGetInclusive.first(articlePage, "div", "class", TextComparitor.C, "page-content"); // use whatever tag you have found via the "View Content" // Button on your browser. You only need to find this tag // once per website! // Now pass the 'articleBody' to this 'processArticle' method. // You will also have to retrieve the "Article Title" manually as well. // Hopefully it is obvious that the 'title' could be stored in any number of ways // depending on which site is being viewed. The title location is usually "consistently // the same" as long as your on the same website. String title = "?"; // you must search the page to retrieve the title LC articleLC = LC.es; // Select the (spoken) language used in the article. // This could be LC.vi (Vietnamese), LC.es (Spanish) etc... Ret3<Vector<String>, Vector<String>, String[]> response = processArticle (articleBody, url, title, articleLC, new StorageWriter(), "outdir/"); // The returned String-Vectors will have the translated sentences and words readily // available for use - if you wish to further process the article-content. // The output directory 'outdir/' will have a readable 'index.html' file, along // with any photos that were found on the page already downloaded so they may be // locally included on the output page.
url
- This article's URL to be scraped. This is used, only, for including a link to the articles original page on the output index.html file.title
- This is needed because obtaining the title can be done in myraid ways. If it is kept as an "external option" - this provides more leeway to the coder/programmer.srcLang
- This is just the "two character" language code that Google Cloud Server expects to see.log
- This logs progress to terminal out. Null may be passsed, in which case output will not be displayed. Any implementation ofjava.lang.Appendable
will suffice. Make note that the 'Appendable' interface allows / requires heeding IOException's for it's 'append(...)' methods.targetDirectory
- This is the directory where the image-files and 'index.html' file will be stored.- Returns:
- This will return an instance of:
Ret3<Vector<String>, Vector<String>, String[]>
-
ret3.a (Vector<String>)
This vector contains a list of sentences, or sentence-fragments, in the original language of the news or article.
-
ret3.b (Vector<String>)
This vector contains a list of sentences, or sentence-fragments, in the target language, which is english.
-
ret3.c (String[])
This array of strings contains a list of filenames, one for each image that was present on the original news or article page, and therefore downloaded.
-
- Throws:
java.io.IOException
ImageScraperException
-
-