Class PinYinParse


  • public class PinYinParse
    extends java.lang.Object
    PinYinParse (罗马拼音).

    This class was originally written in the summer of 2016, however, it was in java-script. It parses the output that is generated by Google's Translate website. It takes Romanized Pin-Yin as input, and produces a string of character-word-pronunciation vectors.



    Stateless Class:
    This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.

    • 1 Constructor(s), 1 declared private, zero-argument constructor
    • 1 Method(s), 1 declared static
    • 0 Field(s)


    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static boolean parse​(Appendable DOUT, String simpSentence, String pronSentence, Vector<String> characters, Vector<String> pronunciation)
      The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • parse

          🗕  🗗  🗖
        public static boolean parse​
                    (java.lang.Appendable DOUT,
                     java.lang.String simpSentence,
                     java.lang.String pronSentence,
                     java.util.Vector<java.lang.String> characters,
                     java.util.Vector<java.lang.String> pronunciation)
                throws java.io.IOException
        
        The purpose of this is produce the Parallel arrays (Vector) which contain Chinese Characters and Chinese PinYin based on the results of the Google Translate Query.

        Scrape, non-API Invoation:
        This is of "limited use" - since primarily the input to this function is a String that has been scraped from the Google Translate Website, not a String from a query to Google Cloud Server's Translate-API.

        The API version of Mandarin Translations literally leaves out the Pin-Yin Romanizations, and makes the entire package a lot less useable. The web-site itself can be scraped, and the Pin-Yin obtained, but that String comes from a web-site that changes from time-to-time.

        Using a Bot:
        If scraping Google's Translate Web-site conjurs images of the police coming to your door, another web-site that seems to do pretty good Romanization is Pin1Yin1.com. I have another class that scrapes that site.
        Parameters:
        DOUT - This is filled up with Debug Information as this class is run. It may be any implementation of java's java.lang.Appendable interface.
        simpSentence - This is the complete simplified-Mandarin sentence obtained from news-article.
        pronSentence - This is the pronunciation of the simplified-Mandarin sentence. This should have already been obtained from Google Translate.
        characters - This should be an empty vector. It will be populated by the words from the original Mandarin sentence, based on the pronunciation obtained from Google Translate.
        pronunciation - This should also be an empty vector. It will be populated after the words from the pronunciation sentence have been parsed into individual words.
        Returns:
        boolean This is true if there was possibly an error along the way. The specific requirements for the boolean value are:
        (cSent.length() != totalChinese) && (totalChinese > 0);
        Throws:
        java.io.IOException - The interface java.lang.Appendable mandates that the IOException must be treated as a checked exception for all output operations. Therefore IOException is a required exception in this method' throws clause.
        Code:
        Exact Method Body:
         int totalChinese = 0;
         DOUT.append("********************************************\n");
         DOUT.append("chin = " + simpSentence + "\n");
         DOUT.append("pron = " + pronSentence + "\n");
        
         // remove "alternate" (AUC) versions of A...Z or 0..9 are there..
         String cSent = ZH.convertAnyAUC(simpSentence);
        
         // CHANGED 2018.09.24 - dellAllPunctuation does not remove '.' and ',' between numbers!
         String pSent = ZH.delAllPunctuationPINYIN(pronSentence);
        
         cSent = ZH.delAllPunctuationCHINESE(cSent);
        
         DOUT.append("********************************************\n");
         DOUT.append("After Removing non-alphanumeric UniCode, and Alt-UniCode:\n");
         DOUT.append("cSent=" + cSent + "\n");
         DOUT.append("pSent=" + pSent + "\n");
         DOUT.append("********************************************\n");
        
         // Leading or ending blanks messes this up
         // *** Use trim()
        
         String[] pWords = pSent.trim().split(" ");
        
         for (int i = 0; i < pWords.length; i++)
         {
             String pronWord = pWords[i].trim();
        
             if (pronWord.length() == 0) continue;
        
             // Sometimes alphabetic characters appear in the chinese string.
             int leading = ZH.countLeadingLettersAndNumbers(cSent.substring(totalChinese));
        
             if (leading > 0)
             {
                 String alphaNumericASCII = cSent.substring(totalChinese, totalChinese + leading);
        
                 DOUT.append("*** Found English and Numbers ASCII in Chinese Sentence ***\n");
                 DOUT.append("There are " + leading + " leading alpha numeric characters.");
                 DOUT.append(" [" + alphaNumericASCII + "]\n");
                 DOUT.append("pronunciation word is: [" + pronWord + "]\n");
        
                 pronunciation.add(pronWord);
                 characters.add(alphaNumericASCII);
        
                 totalChinese += leading;
             }
        
             // else - it's just normal characters in the chinese string
             else
             {
                 int numChinese      = ZH.countSyllablesAndNonChinese(pronWord, DOUT);
                 String chineseWord  = cSent.substring(totalChinese, totalChinese + numChinese);
        
                 DOUT.append("The word [" + pronWord + "] ");
                 DOUT.append("corresponds to " + numChinese + " Unicode Characters ");
                 DOUT.append("[" + chineseWord + "]\n");
        
                 // Add the new word to the list
                 pronunciation.add(pronWord);
                 characters.add(chineseWord);
        
                 totalChinese += numChinese;
             }
         }
        
         DOUT.append(
             "********************************************\n" +
             "COMPLETED SENTENCE LOOP\n" +
             "SUMMARY:\n" +
             "FOUND (" + totalChinese + ") characters in Chinese String\n" +
             "STRING CONTAINS (" + cSent.length() + ") characters\n" +
             ((totalChinese != cSent.length()) ? "\nPOSSIBLE ERROR MISMATCH\n\n" : "") +
             "********************************************\n"
         );
        
         return (cSent.length() != totalChinese) && (totalChinese > 0);