Package Torello.Languages
Class PinYinParse
- java.lang.Object
- Torello.Languages.PinYinParse
public class PinYinParse extends java.lang.Object
PinYinParse (罗马拼音).
This class was originally written in the summer of 2016, however, it was in java-script. It parses the output that is generated by Google's Translate website. It takes Romanized Pin-Yin as input, and produces a string of character-word-pronunciation vectors.
Hi-Lited Source-Code:- View Here: Torello/Languages/
- Open New Browser-Tab: Torello/Languages/
File Size: 6,286 Bytes Line Count: 147 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 1 Method(s), 1 declared static
- 0 Field(s)
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static boolean
parse(Appendable DOUT, String simpSentence, String pronSentence, Vector<String> characters, Vector<String> pronunciation)
The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.
Method Detail
public static boolean parse (java.lang.Appendable DOUT, java.lang.String simpSentence, java.lang.String pronSentence, java.util.Vector<java.lang.String> characters, java.util.Vector<java.lang.String> pronunciation) throws
The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.
Scrape, non-API Invoation:
This is of "limited use" - since primarily the input to this function is aString
that has been scraped from theGoogle Translate Website
, not aString
from a query to Google Cloud Server'sTranslate-API
The API version of Mandarin Translations literally leaves out the Pin-Yin Romanizations, and makes the entire package a lot less useable. The web-site itself can be scraped, and the Pin-Yin obtained, but that String comes from a web-site that changes from time-to-time.
Using a Bot:
If scraping Google's Translate Web-site conjurs images of the police coming to your door, another web-site that seems to do pretty good Romanization is I have another class that scrapes that site.- Parameters:
- This is filled up with Debug Information as this class is run. It may be any implementation of java'sjava.lang.Appendable
- This is the complete simplified-Mandarin sentence obtained from news-article.pronSentence
- This is the pronunciation of the simplified-Mandarin sentence. This should have already been obtained from Google Translate.characters
- This should be an empty vector. It will be populated by the words from the original Mandarin sentence, based on the pronunciation obtained from Google Translate.pronunciation
- This should also be an empty vector. It will be populated after the words from the pronunciation sentence have been parsed into individual words.- Returns:
- boolean This is true if there was possibly an error along the way.
The specific requirements for the boolean value are:
(cSent.length() != totalChinese) && (totalChinese > 0);
- Throws:
- Theinterface java.lang.Appendable
mandates that theIOException
must be treated as a checked exception for all output operations. ThereforeIOException
is a required exception in this method' throws clause.- Code:
- Exact Method Body:
int totalChinese = 0; DOUT.append("********************************************\n"); DOUT.append("chin = " + simpSentence + "\n"); DOUT.append("pron = " + pronSentence + "\n"); // remove "alternate" (AUC) versions of A...Z or 0..9 are there.. String cSent = ZH.convertAnyAUC(simpSentence); // CHANGED 2018.09.24 - dellAllPunctuation does not remove '.' and ',' between numbers! String pSent = ZH.delAllPunctuationPINYIN(pronSentence); cSent = ZH.delAllPunctuationCHINESE(cSent); DOUT.append("********************************************\n"); DOUT.append("After Removing non-alphanumeric UniCode, and Alt-UniCode:\n"); DOUT.append("cSent=" + cSent + "\n"); DOUT.append("pSent=" + pSent + "\n"); DOUT.append("********************************************\n"); // Leading or ending blanks messes this up // *** Use trim() String[] pWords = pSent.trim().split(" "); for (int i = 0; i < pWords.length; i++) { String pronWord = pWords[i].trim(); if (pronWord.length() == 0) continue; // Sometimes alphabetic characters appear in the chinese string. int leading = ZH.countLeadingLettersAndNumbers(cSent.substring(totalChinese)); if (leading > 0) { String alphaNumericASCII = cSent.substring(totalChinese, totalChinese + leading); DOUT.append("*** Found English and Numbers ASCII in Chinese Sentence ***\n"); DOUT.append("There are " + leading + " leading alpha numeric characters."); DOUT.append(" [" + alphaNumericASCII + "]\n"); DOUT.append("pronunciation word is: [" + pronWord + "]\n"); pronunciation.add(pronWord); characters.add(alphaNumericASCII); totalChinese += leading; } // else - it's just normal characters in the chinese string else { int numChinese = ZH.countSyllablesAndNonChinese(pronWord, DOUT); String chineseWord = cSent.substring(totalChinese, totalChinese + numChinese); DOUT.append("The word [" + pronWord + "] "); DOUT.append("corresponds to " + numChinese + " Unicode Characters "); DOUT.append("[" + chineseWord + "]\n"); // Add the new word to the list pronunciation.add(pronWord); characters.add(chineseWord); totalChinese += numChinese; } } DOUT.append( "********************************************\n" + "COMPLETED SENTENCE LOOP\n" + "SUMMARY:\n" + "FOUND (" + totalChinese + ") characters in Chinese String\n" + "STRING CONTAINS (" + cSent.length() + ") characters\n" + ((totalChinese != cSent.length()) ? "\nPOSSIBLE ERROR MISMATCH\n\n" : "") + "********************************************\n" ); return (cSent.length() != totalChinese) && (totalChinese > 0);