Package Torello.Languages
Class PinYinParse
- java.lang.Object
-
- Torello.Languages.PinYinParse
-
public class PinYinParse extends java.lang.Object
PinYinParse (罗马拼音).
This class was originally written in the summer of 2016, however, it was in java-script. It parses the output that is generated by Google's Translate website. It takes Romanized Pin-Yin as input, and produces a string of character-word-pronunciation vectors.
Hi-Lited Source-Code:- View Here: Torello/Languages/PinYinParse.java
- Open New Browser-Tab: Torello/Languages/PinYinParse.java
File Size: 6,286 Bytes Line Count: 147 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 1 Method(s), 1 declared static
- 0 Field(s)
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static boolean
parse(Appendable DOUT, String simpSentence, String pronSentence, Vector<String> characters, Vector<String> pronunciation)
The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.
-
-
-
Method Detail
-
parse
public static boolean parse (java.lang.Appendable DOUT, java.lang.String simpSentence, java.lang.String pronSentence, java.util.Vector<java.lang.String> characters, java.util.Vector<java.lang.String> pronunciation) throws java.io.IOException
The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.
Scrape, non-API Invoation:
This is of "limited use" - since primarily the input to this function is aString
that has been scraped from theGoogle Translate Website
, not aString
from a query to Google Cloud Server'sTranslate-API
.
The API version of Mandarin Translations literally leaves out the Pin-Yin Romanizations, and makes the entire package a lot less useable. The web-site itself can be scraped, and the Pin-Yin obtained, but that String comes from a web-site that changes from time-to-time.
Using a Bot:
If scraping Google's Translate Web-site conjurs images of the police coming to your door, another web-site that seems to do pretty good Romanization is Pin1Yin1.com. I have another class that scrapes that site.- Parameters:
DOUT
- This is filled up with Debug Information as this class is run. It may be any implementation of java'sjava.lang.Appendable
interface.simpSentence
- This is the complete simplified-Mandarin sentence obtained from news-article.pronSentence
- This is the pronunciation of the simplified-Mandarin sentence. This should have already been obtained from Google Translate.characters
- This should be an empty vector. It will be populated by the words from the original Mandarin sentence, based on the pronunciation obtained from Google Translate.pronunciation
- This should also be an empty vector. It will be populated after the words from the pronunciation sentence have been parsed into individual words.- Returns:
- boolean This is true if there was possibly an error along the way.
The specific requirements for the boolean value are:
(cSent.length() != totalChinese) && (totalChinese > 0);
- Throws:
java.io.IOException
- Theinterface java.lang.Appendable
mandates that theIOException
must be treated as a checked exception for all output operations. ThereforeIOException
is a required exception in this method' throws clause.- Code:
- Exact Method Body:
int totalChinese = 0; DOUT.append("********************************************\n"); DOUT.append("chin = " + simpSentence + "\n"); DOUT.append("pron = " + pronSentence + "\n"); // remove "alternate" (AUC) versions of A...Z or 0..9 are there.. String cSent = ZH.convertAnyAUC(simpSentence); // CHANGED 2018.09.24 - dellAllPunctuation does not remove '.' and ',' between numbers! String pSent = ZH.delAllPunctuationPINYIN(pronSentence); cSent = ZH.delAllPunctuationCHINESE(cSent); DOUT.append("********************************************\n"); DOUT.append("After Removing non-alphanumeric UniCode, and Alt-UniCode:\n"); DOUT.append("cSent=" + cSent + "\n"); DOUT.append("pSent=" + pSent + "\n"); DOUT.append("********************************************\n"); // Leading or ending blanks messes this up // *** Use trim() String[] pWords = pSent.trim().split(" "); for (int i = 0; i < pWords.length; i++) { String pronWord = pWords[i].trim(); if (pronWord.length() == 0) continue; // Sometimes alphabetic characters appear in the chinese string. int leading = ZH.countLeadingLettersAndNumbers(cSent.substring(totalChinese)); if (leading > 0) { String alphaNumericASCII = cSent.substring(totalChinese, totalChinese + leading); DOUT.append("*** Found English and Numbers ASCII in Chinese Sentence ***\n"); DOUT.append("There are " + leading + " leading alpha numeric characters."); DOUT.append(" [" + alphaNumericASCII + "]\n"); DOUT.append("pronunciation word is: [" + pronWord + "]\n"); pronunciation.add(pronWord); characters.add(alphaNumericASCII); totalChinese += leading; } // else - it's just normal characters in the chinese string else { int numChinese = ZH.countSyllablesAndNonChinese(pronWord, DOUT); String chineseWord = cSent.substring(totalChinese, totalChinese + numChinese); DOUT.append("The word [" + pronWord + "] "); DOUT.append("corresponds to " + numChinese + " Unicode Characters "); DOUT.append("[" + chineseWord + "]\n"); // Add the new word to the list pronunciation.add(pronWord); characters.add(chineseWord); totalChinese += numChinese; } } DOUT.append( "********************************************\n" + "COMPLETED SENTENCE LOOP\n" + "SUMMARY:\n" + "FOUND (" + totalChinese + ") characters in Chinese String\n" + "STRING CONTAINS (" + cSent.length() + ") characters\n" + ((totalChinese != cSent.length()) ? "\nPOSSIBLE ERROR MISMATCH\n\n" : "") + "********************************************\n" ); return (cSent.length() != totalChinese) && (totalChinese > 0);
-
-