java.lang.Object
- Torello.Languages.ZH

```
public class ZH
extends java.lang.Object
```
ZH (Mandarin Chinese) Many tools for parsing constructs from Mandarin News & other Web-Sites.

A series of simple Helper Routines for inspecting the special UTF-8 (non-Mandarin) characters often used in Mandarin HTML Web-Pages.
Hi-Lited Source-Code:
- View Here: Torello/Languages/ZH.java
- Open New Browser-Tab: Torello/Languages/ZH.java
File Size: 58,688 Bytes Line Count: 1,469 '\n' Characters Found
Stateless Class:
This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 35 Method(s), 35 declared static
- 7 Field(s), 7 declared static, 7 declared final

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`AUC`	The complete list of "higher-level" (alternate) Uni-Code chars.
`static char`	`CONSTSpecialQuoteLeft`	Special Quotation Mark, left-side
`static char`	`CONSTSpecialQuoteRight`	Special Quotation Mark, right-side

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static char`	`alphaNumericAUC(char c)`	Alpha-Numeric character code from upper UniCode / UTF-8 These characters exist in UTF-8 - but they ARE NOT the usual ASCII characters for the letters `'A' ... 'Z'` or the numbers `'0' ... '9'` They, however, are sometimes found in documents on Chinese News Websites, etc.
`static char`	`bracketAUC(char c)`	Brackets - any version.
`static int`	`bulletListAUC(char c)`	Bullet List characters in upper `UniCode / UTF-8`.
`static char`	`commaAUC(char c)`	Comma - any version.
`static String`	`convertAnyAUC(String s)`	Checks for higher-Unicode letters and numbers, and converts them into lower-level versions of the appropriate letter or number.
`static int`	`countLeadingLettersAndNumbers(String chineseSentence)`	Checks for any leading alphabetic `('a' ... 'z')` and numeric `('0' ... '9')` characters in a Chinese `String`.
`static int`	`countSyllablesAndNonChinese(String word, Appendable DOUT)`	Counts syllables in a "word" of PinYin.
`static int`	`countToneVowels(String pinYinStr)`	Counts the number of tone vowels in a PinYin `String`.
`static String`	`delAllPunctuationCHINESE(String s)`	Deletes all punctuation & non-character symbols.
`static String`	`delAllPunctuationPINYIN(String s)`	Deletes all punctuation & non-character symbols from a `String` of PinYin.
`static char`	`endOfPhrase(char c)`	endOfPhrase - any version of the end-of-phrase markers usually used in Mandarin Chinese text.
`static char`	`endOfPhraseAUC(char c)`	Checks for end-of-phrase punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark.
`static char`	`endOfSentence(char c)`	Checks for end-of-sentence punctuation marks.
`static char`	`endOfSentenceAUC(char c)`	Checks for end-of-sentence punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark.
`static String`	`formatUTF8Chinese(char c)`	This is used to convert a Chinese Character into a full `String` that includes the UTF-8 code represented as a `HEXADECIMAL` number and a `decimal` number
`static int`	`GTPPEIndexOf(String s, char c)`	GTPPE: Google Translate Punctuation Pronunciation Equivalent This searches through a `String` to find the location of the "equivalent punctuation mark"
`static String`	`HTML2ChineseVowels(String s)`	Google Translate returns some text encoded as `"&#num;" (the "ord(c)").` This is also called `HTML Escaped Code` - because instead of actual ASCII/UTF8 characters themselves, their "Ord" are returned - surrounded by the usual HTML Escape Character Sequence &#num; This method does the `chr(html-hex-escape-code);` and replaces the `escape-sequence` (which again is &#NUM;) with the actual ASCII character.
`static String`	`HTML2UTF8(String s)`	NOTE: This does the same as `HTML2ChineseVowels(String)` *EXCEPT* that it converts *ANY* HTML string that has been encoded as: `&#NUM;` - not just the characters having accents and corresponding to Chinese Tone Vowels.
`static boolean`	`isAlpha(char c)`	Checks if a `char` is Alphabetic.
`static boolean`	`isAlphaNumeric(char c)`	Checks if a `char` is Alpha Numberic.
`static boolean`	`isBPMFAUC(char c)`	Bo Po Mo Fo (注音符號).
`static boolean`	`isChinese(char c)`	Helper function - checks if this is a character in the UTF-8 & ASCII ranges that contain Mandarin Chinese characters.
`static boolean`	`isNumber(char c)`	Regular Numbers Include: `'0' ... '9'`
`static boolean`	`isOther(char c)`	Checks a `char` is something that is not `Alpha Numeric` or `White Space`
`static boolean`	`isRegLetter(char c)`	Regular Letters Include: `'A' ... 'Z'` (65 - 90), `'a' ... 'z'` (97 - 122)
`static boolean`	`isRegVowel(char c)`	Checks that a character is a standard vowel.
`static boolean`	`isSpace(char c)`	Checks for WhiteSpace: `'\t', '\n', '\r', ' '`
`static boolean`	`isToneVowel(char c)`	This is a helper function for the Mandarin Chinese accented vowel symbols in `UTF-8, ASCII` and `UniCode`.
`static char`	`parenAUC(char c)`	Parenthesis - any version.
`static char`	`punctuationAUC(char c)`	This method, `punctuationAUC(char)`, converts any characters which are common on many Mandarin Chinese websites into a lower-level, more typical/normal ASCII equivalent.
`static char`	`quoteAUC(char c)`	Quotes - any version.
`static String`	`testAUC()`
`static String`	`toneVowelsToRegularVowels(String s)`	This performs a conversion of all vowels in a `String` from those with tones over them to the normal (un-accented) equivalent.
`static char`	`toneVowelToRegularVowel(char c)`	This makes the problems of dealing with the tone/accent marks above vowels in Chinese Pin-Yin easier.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

AUC

🡇 ⇈ ⮫ 🗕 🗗 🗖

public static final java.lang.String AUC

The complete list of "higher-level" (alternate) Uni-Code chars. Many of these are alternate punctuation marks used in documents that contain Mandarin Chinese.

See Also:

Constant Field Values

Code:

Exact Field Declaration Expression:

 public static final String AUC = 
         // Special Punctuation characters found in Chinese HTML Pages
         "、 。 · ˉ ˇ ¨ 〃 々 — ～ ‖ … ‘ ’ "             +
         "“ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】"	  +
         "± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠"               +
         "⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ "           +
         "∴ ♂ ♀ ° ′ ″ ℃ ＄ ¤ ￠ ￡ ‰ § № ☆ ★"          +
         "○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 "            +
         "！ ＂ ＃ ￥ ％ ＆ ＇ （ ） ＊ ＋ ， － ． ／"      +

         // Extra Alphabetic and Numeric Characters sometimes used
         // on web-pages written in Chinese
         "０ １ ２ ３ ４ ５ ６ ７ ８ ９ ： ； ＜ ＝ ＞ ？"   +
         "＠ Ａ Ｂ Ｃ Ｄ Ｅ Ｆ Ｇ Ｈ Ｉ Ｊ Ｋ Ｌ Ｍ Ｎ Ｏ"   +
         "Ｐ Ｑ Ｒ Ｓ Ｔ Ｕ Ｖ Ｗ Ｘ Ｙ Ｚ ［ ＼ ］ ＾ ＿"   +
         "｀ ａ ｂ ｃ ｄ ｅ ｆ ｇ ｈ ｉ ｊ ｋ ｌ ｍ ｎ ｏ"   +
         "ｐ ｑ ｒ ｓ ｔ ｕ ｖ ｗ ｘ ｙ ｚ ｛ ｜ ｝ ￣"      +

         // Certain "Bullet List" / "Bullet Point" markers
         "⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖"      +
         "⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾"   +
         "⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦"         +
         "⑧ ⑨ ⑩ ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩"               +
         "Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ"               +

         // The "Bo Po Mo Fo" Pronunciation Used for Chinese Characters
         "ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ"   +
         "ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ";

CONSTSpecialQuoteLeft

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static final char CONSTSpecialQuoteLeft
```
Special Quotation Mark, left-side
See Also:

Constant Field Values

Code:
Exact Field Declaration Expression:

public static final char CONSTSpecialQuoteLeft = (char) 0x201C;

CONSTSpecialQuoteRight

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static final char CONSTSpecialQuoteRight
```
Special Quotation Mark, right-side
See Also:

Constant Field Values

Code:
Exact Field Declaration Expression:

public static final char CONSTSpecialQuoteRight = (char) 0x201D;

Method Detail

toneVowelToRegularVowel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char toneVowelToRegularVowel(char c)
```
This makes the problems of dealing with the tone/accent marks above vowels in Chinese Pin-Yin easier. These convert vowels with tones over them into regular vowels. This can be useful for certain String operations, although clearly the original meaning of the word would be decimated.
Parameters:

c - any character from ASCII / UTF-8 / UniCode Basic Multi Lingual Plane.

Returns:

if this is a UTF-8 character that is an accented vowel, the un-accented version of that vowel is returned. If this is not a PinYin symbol for a tone-vowel, ASCII 0 is returned.

See Also:

toneVowelsToRegularVowels(String)

Code:
Exact Method Body:

for (int i=0; i < CV.length; i++) if (CV[i] == c) return CV2RV[i]; return (char) 0;

countToneVowels

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static int countToneVowels(java.lang.String pinYinStr)

Counts the number of tone vowels in a PinYin String.

Parameters:

pinYinStr - A String, usually generated by Google Translate, (and scraped from Google Translate) that contains PinYin.

Returns:

The number of Mandarin Chinese Pin-Yin "Tone Vowels"

Code:

Exact Method Body:

 int count=0;

 TOP:
 for (int i = pinYinStr.length()-1; i >= 0; i--)
     for (int j=0; j < CV.length; j++)
         if (pinYinStr.charAt(i) == CV[j])
             { count++; continue TOP; }

 return count;

toneVowelsToRegularVowels

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String toneVowelsToRegularVowels
            (java.lang.String s)

This performs a conversion of all vowels in a String from those with tones over them to the normal (un-accented) equivalent. It uses the single-character-version of the synonymously named method

Parameters:

s - any java.lang.String containing Mandarin Romanizations.

Returns:

a String with all accented vowel's converted to regular vowels.

See Also:

toneVowelToRegularVowel(char)

Code:

Exact Method Body:

 int             strlen  = s.length();
 StringBuilder   sb      = new StringBuilder(s.length());
 char            c;

 for (int i=0; i < strlen; i++)
     if ((c = toneVowelToRegularVowel(s.charAt(i))) != 0)
         sb.append(c);
     else
         sb.append(s.charAt(i));

 return sb.toString();

HTML2ChineseVowels

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String HTML2ChineseVowels(java.lang.String s)
```
Google Translate returns some text encoded as "&#num;" (the "ord(c)"). This is also called HTML Escaped Code - because instead of actual ASCII/UTF8 characters themselves, their "Ord" are returned - surrounded by the usual HTML Escape Character Sequence &#num; This method does the chr(html-hex-escape-code); and replaces the escape-sequence (which again is &#NUM;) with the actual ASCII character.

NOTE: all of these are for "Chinese Tone Vowel" ASCII - The Google Translate module uses this method quite a bit. Here are a few examples of HTML-Escape-Sequence and the corresponding ASCII.

HTML-Escaped ASCII/UTF-8 Character

À À

á á

ě ě

ū ū

ǚ ǚ

... see array below for list

NOTE: HTML2UTF8(String) ==> This method does the exact same thing - but does not limit the characters to be converted to only Chinese Tone Vowels. This method only converts HTML-Escaped-Characters from this list:

private static final int[] H2CV = { 39, 192, 201, 224, 225, 232, 233, 236, 237, 242, 243, 249, 250, 252, 256, 257, 275, 283, 299, 333, 363, 462, 464, 466, 468, 474, 476 };
See Also:

HTML2UTF8(String)

Code:
Exact Method Body:

for (int i=0; i < H2CV.length; i++) s = s.replaceAll("&#" + H2CV[i] + ";", "" + (char) H2CV[i]); return s;

HTML2UTF8

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String HTML2UTF8(java.lang.String s)

NOTE: This does the same as HTML2ChineseVowels(String) EXCEPT that it converts ANY HTML string that has been encoded as: &#NUM; - not just the characters having accents and corresponding to Chinese Tone Vowels.

See Also:

HTML2ChineseVowels(String)

Code:

Exact Method Body:

 // Build the list of UTF8/ASCII character values (as Ord(c) / int) first.
 HashSet<Integer>    utfList = new HashSet<Integer>();
 Matcher             m       = P1.matcher(s);

 while (m.find()) utfList.add(Integer.parseInt(m.group(1)));

 // Now convert them.
 for (Integer i : utfList) s = s.replaceAll("&#" + i.toString() + ";", "" + ((char) i.intValue()));

 return s;

formatUTF8Chinese

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String formatUTF8Chinese(char c)
```
This is used to convert a Chinese Character into a full String that includes the UTF-8 code represented as a HEXADECIMAL number and a decimal number
Parameters:

c - any ASCII/UniCode/UTF-8 char - but, generally, expected to be a "Chinese Character."

NOTE: The choice for parameter char c has no actual constraints on its input value.

Returns:

A String of this format: 掭(0x63AD, 25517)

Code:
Exact Method Body:

return c + "(0x" + String.format("%x", ((int) c)).toUpperCase() + ", " + ((int) c) + ")";

isChinese

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isChinese(char c)
```
Helper function - checks if this is a character in the UTF-8 & ASCII ranges that contain Mandarin Chinese characters. This is not guaranteed to be accurate - some non-Chinese Japanese characters exist in this range. For the precise definition of what this function actually does, see the ranges printed below.

COPIED FROM***
http://www.khngai.com/chinese/charmap/tbluni.php?page=0

AND: ((c >= 0x4E00) && (c <= 0x9FFF))

COPIED FROM***
http://www.khngai.com/chinese/charmap/tblgb.php?page=1
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' is in the UTF-8/UniCode range for Chinese Characters

Code:
Exact Method Body:

if ((c >= 0x4E00) && (c <= 0x9FFF)) return true; if ((c >= 0xB0A0) && (c <= 0xBFFF)) return true; if ((c >= 0xC0A0) && (c <= 0xCFFF)) return true; if ((c >= 0xD0A0) && (c <= 0xDFFF)) return true; if ((c >= 0xE0A0) && (c <= 0xEFFF)) return true; if ((c >= 0xF0A0) && (c <= 0xF7FF)) return true; return false;

isOther

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isOther(char c)
```
Checks a char is something that is not Alpha Numeric or White Space
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

((!isAlphaNumeric(c)) && (!isSpace(c)));

Code:
Exact Method Body:

return ((!isAlphaNumeric(c)) && (!isSpace(c)));

isAlphaNumeric

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isAlphaNumeric(char c)
```
Checks if a char is Alpha Numberic.
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

(isAlpha(c) || isNumber(c));

Code:
Exact Method Body:

return (isAlpha(c) || isNumber(c));

isAlpha

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isAlpha(char c)
```
Checks if a char is Alphabetic.
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

(isToneVowel(c) || isRegVowel(c) || isRegLetter(c));

Code:
Exact Method Body:

return (isToneVowel(c) || isRegVowel(c) || isRegLetter(c));

isToneVowel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static boolean isToneVowel(char c)

This is a helper function for the Mandarin Chinese accented vowel symbols in UTF-8, ASCII and UniCode. The exact character code numbers are printed below.

NOTE: In 罗马拼音 (Pin-Yin Romanization), there are a few symbols that should never come up - at least as the software pertains to 罗马拼音-results provided by Google Cloud Server Translation API (GCS-TS/TAPI). This is because NO word in Pin-Yin ever starts with the letter's I or U, or the U with an umlau - so - capitalized versions of these letters ought to never occur - unless the entire PinYin were capitalized - which is something GCSTS never does.

Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' is one of the following:

Simple ASCII	UTF-8 Tone Vowel
a	ā (257), á (225), ǎ (462), à (224)
e	ē (275), é (233), ě (283), è (232)
i	ī (299), í (237), ǐ (464), ì (236)
o	ō (333), ó (243), ǒ (466), ò (242)
u	ū (363), ú (250), ǔ (468), ù (249)
u	ǖ (470), ǘ (472), ǚ (474), ǜ (476)
A	Ā (256), Á (193), Ǎ (461), À (192)
E	Ē (274), É (201), Ě (282), È (200)
O	Ō (332), Ó (211), Ǒ (465), Ò (210)

In Mandarin Chinese, PinYin-words cannot start with these letters below. Therefore it would be highly unlikely to see a "capitalized" version of these tone-vowels.

Simple ASCII	UTF-8 Tone Vowel
I	Ī (298), Í (205), (there are 2: Ǐ (463), Ĭ (300)), Ì (204)
U	Ū (362), Ú (218), Ŭ (364), Ù (217)
U	(Ü (220) -no tone): Ǖ (469), Ǘ (471), Ǘ (473), Ǜ (475)

Code:

Exact Method Body:

 // A, ā 257, á 225, ǎ 462, à 224
 if ((c == 257) || (c == 225) || (c == 462) || (c == 224)) return true;

 // E, ē 275, é 233, ě 283, è 232
 if ((c == 275) || (c == 233) || (c == 283) || (c == 232)) return true;
              
 // I, ī 299, í 237, ǐ 464, ì 236 
 if ((c == 299) || (c == 237) || (c == 464) || (c == 236)) return true;

 // O, ō 333, ó 243, ǒ	466, ò 242
 if ((c == 333) || (c == 243) || (c == 466) || (c == 242)) return true;

 // U, ū 363, ú 250, ǔ 468, ù 249
 if ((c == 363) || (c == 250) || (c == 468) || (c == 249)) return true;

 // U, ǖ 470, ǘ 472, ǚ 474, ǜ 476
 if ((c == 470) || (c == 472) || (c == 474) || (c == 476)) return true;

 // *******
 // Capital vowels with tone symbols

 // Ā 256, Á 193, Ǎ 461, À 192
 if ((c == 256) || (c == 193) || (c == 461) || (c == 192)) return true;

 // Ē 274, É 201, Ě 282, È 200
 if ((c == 274) || (c == 201) || (c == 282) || (c == 200)) return true;

 // Ō 332, Ó 211, Ǒ 465, Ò 210
 if ((c == 332) || (c == 211) || (c == 465) || (c == 210)) return true;

 // Not sure about these - found them on a website
 // **********************************************
 //       1234 5678 9ABC DEF
 // A8A0  āáǎà ēéěè  īíǐì ōóǒ
 //
 //       0 1234 5678 9 A
 // A8B0  ò ūúǔù  ǖǘǚǜ  ü ê
 // **********************************************
 if ((c >= 0xA8A1) && (c <= 0xA8Ba)) return true;

 return false;

isRegVowel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static boolean isRegVowel(char c)

Checks that a character is a standard vowel.

Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' EQUALS one of these ten letters: a, e, i, o, u, A, E, I, O, U

Code:

Exact Method Body:

 // The normal vowels

 // a 97, A 65
 if ((c == 97) || (c == 65))     return true;

 // e 101, E 69
 if ((c == 101) || (c == 69))    return true;

 // i 105, I 73
 if ((c == 105) || (c == 73))    return true;

 // o 111, O 79
 if ((c == 111) || (c == 79))    return true;

 // u 117, U 85
 if ((c == 117) || (c == 85))    return true;

 return false;

isRegLetter

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isRegLetter(char c)
```
Regular Letters Include: 'A' ... 'Z' (65 - 90), 'a' ... 'z' (97 - 122)
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' is any letter in lower-level ASCII (and not any of the AUC).

Code:
Exact Method Body:

return ((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122));

isNumber

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isNumber(char c)
```
Regular Numbers Include: '0' ... '9'
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' is in the range of ASCII '0' ... '9' (not any of the AUC)

Code:
Exact Method Body:

return ((c >= 48) && (c <= 57));

isSpace

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isSpace(char c)
```
Checks for WhiteSpace: '\t', '\n', '\r', ' '
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

TRUE if the input character 'c' is a whitespace character code from the above list

Code:
Exact Method Body:

return ((c == 9) || (c == 12) || (c == 15) || (c == 32));

bulletListAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static int bulletListAUC(char c)
```
Bullet List characters in upper UniCode / UTF-8. These characters exist in UTF-8 - and they are occasionally used in documents found on Chinese News Websites. They are all "bullet-list" points. An integer is returned for each of these, that is equal to the number represented by the UTF-8/UniCode character here.
- 0 1 2 3 4 5 6 7 8 9 a b c d e f
- N ⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖
- ⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾
- ⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦
- ⑧ ⑨ ⑩ N N ㈠㈡㈢㈣㈤㈥㈦㈧㈨㈩ N
- N Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ
Parameters:

c - any character as input

Returns:

The number equivalent represented by this bullet point.

Code:
Exact Method Body:

// ⒈ ==> ⒛ if ((c >= 0x2488) && (c <= 0x249B)) return ((int) c) - 0x2487; // ⑴ ==> ⒇ if ((c >= 0x2474) && (c <= 0x2487)) return ((int) c) - 0x2473; // ① ==> ⑩ if ((c >= 0x2460) && (c <= 0x2469)) return ((int) c) - 0x245F; // ㈠ ==> ㈩ if ((c >= 0x3220) && (c <= 0x3229)) return ((int) c) - 0x321F; // Ⅰ ==> Ⅻ if ((c >= 0x2160) && (c <= 0x216B)) return ((int) c) - 0x215F; return 0;

alphaNumericAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char alphaNumericAUC(char c)
```
Alpha-Numeric character code from upper UniCode / UTF-8

These characters exist in UTF-8 - but they ARE NOT the usual ASCII characters for the letters 'A' ... 'Z' or the numbers '0' ... '9' They, however, are sometimes found in documents on Chinese News Websites, etc.

Copied from:
http://www.khngai.com/chinese/charmap/tblgb.php?page=0
- 0 1 2 3 4 5 6 7 8 9 a b c d e f
- ！＂＃￥％＆＇（）＊＋，－．／
- ０１２３４５６７８９：；＜＝＞？
- ＠ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯ
- ＰＱＲＳＴＵＶＷＸＹＺ［＼］＾＿
- ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏ
- ｐｑｒｓｔｕｖｗｘｙｚ｛｜｝￣
Parameters:

c - any character as input

Returns:

the "lower-level-ASCII" version of that character.

Code:
Exact Method Body:

// ASCII 'A' is 65 if ((c > 0xFF20) && (c < 0xFF3B)) return (char) (65 + (c - 0xFF21)); // ASCII 'a' is 97 if ((c > 0xFF40) && (c < 0xFF5B)) return (char) (97 + (c - 0xFF41)); // ASCII '0' is 48 if ((c >= 0xFF10) && (c <= 0xFF1A)) return (char) (48 + (c - 0xFF10)); return 0;

punctuationAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char punctuationAUC(char c)

This method, punctuationAUC(char), converts any characters which are common on many Mandarin Chinese websites into a lower-level, more typical/normal ASCII equivalent. This is can be very useful when trying to make sense of brackets, parenthesis, quotes, commas and other punctuation marks - and quickly convert them into a simple version of the character.

If the input character has an "Alternate Version" in the lower-level-ASCII range, that lower level ASCII character is returned. If this isn't AUC, ASCII-0 is returned.

For Instance:

Input	Output
〖〗【】	[ ] [ ]
。 ○ ● ．	. (ASCII-period)
¨ 〃 “ ” ″ ＂	" (ASCII-double-quote)
, (ASCII-comma)	ASCII-0
+ (ASCII-plus)	ASCII-0

Parameters:

c - any character as input

Returns:

the "lower-level-ASCII" version of that character

NOTE: ASCII-0 is returned if this is not a valid "AUC" UTF-8 / UniCode code!

Code:

Exact Method Body:

 // Copied from: 
 // *** http://www.khngai.com/chinese/charmap/tblgb.php?page=0
 //
 // 0 2 3 4 5 6 7 8 9 a b c d e f
 // N N 、 。 · ˉ ˇ ¨ 〃 々 — ～ ‖ … ‘ ’ 
 // “ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】
 // ± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠
 // ⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ 
 // ∴ ♂ ♀ ° ′ ″ ℃ ＄ ¤ ￠ ￡ ‰ § № ☆ ★
 // ○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 
 //
 // 0 1 2 3 4 5 6 7 8 9 a b c d e f
 // ！ ＂ ＃ ￥ ％ ＆ ＇ （ ） ＊ ＋ ， － ． ／
 // ０ １ ２ ３ ４ ５ ６ ７ ８ ９ ： ； ＜ ＝ ＞ ？
 // ＠ Ａ Ｂ Ｃ Ｄ Ｅ Ｆ Ｇ Ｈ Ｉ Ｊ Ｋ Ｌ Ｍ Ｎ Ｏ
 // Ｐ Ｑ Ｒ Ｓ Ｔ Ｕ Ｖ Ｗ Ｘ Ｙ Ｚ ［ ＼ ］ ＾ ＿
 // ｀ ａ ｂ ｃ ｄ ｅ ｆ ｇ ｈ ｉ ｊ ｋ ｌ ｍ ｎ ｏ
 // ｐ ｑ ｒ ｓ ｔ ｕ ｖ ｗ ｘ ｙ ｚ ｛ ｜ ｝ ￣	 

 switch (c)
 {
     // 、 ，
     case 0x3001:               // 、
     case 0xFF0C: return ',';   // ，

     // 。 ○ ● ．
     case 0x3002:               // 。
     case 0x25CB:               // ○
     case 0x25CF:               // ●
     case 0xFF0E: return '.';   // ．

     // ‘ ’ ′ ＇ ｀
     case 0x2018:               // ‘
     case 0x2019:               // ’
     case 0x2032:               // ′
     case 0xFF07:               // ＇
     case 0xFF40: return '\'';  // ｀

     // ¨ 〃 “ ” ″ ＂
     case 0x00A8:               // ¨
     case 0x3003:               // 〃
     case 0x201C:               // “
     case 0x201D:               // ”
     case 0x2033:               // ″
     case 0xFF02: return '\"';  // ＂

     // 〔 （
     case 0x3014:               // 〔
     case 0xFF08: return '(';   // （

     // 〕 ）
     case 0x3015:               // 〕
     case 0xFF09: return ')';   // ）

     // 〈 ＜
     case 0x3008:               // 〈
     case 0xFF1C: return '<';   // ＜

     // 〉 ＞
     case 0x3009:               // 〉
     case 0xFF1E: return '>';   // ＞

     // 「 『 〖 【 ［
     case 0x300C:               // 「
     case 0x300E:               // 『
     case 0x3016:               // 〖
     case 0x3010:               // 【
     case 0xFF3B: return '[';   // ［

     // 」 』 〗】 ］
     case 0x300D:               // 」
     case 0x300F:               // 』
     case 0x3017:               // 〗
     case 0x3011:               // 】
     case 0xFF3D: return ']';   // ］

     // ∶ ：
     case 0x2236:               // ∶
     case 0xFF1A: return ':';   // ：

     case 0xFF01: return '!';   // ！
     case 0xFF03: return '#';   // ＃
     case 0xFF05: return '%';   // ％
     case 0xFF06: return '&';   // ＆
     case 0xFF1F: return '?';   // ？
     case 0xFF0F: return '/';   // ／
     case 0xFF3E: return '^';   // ＾
     case 0xFF5B: return '{';   // ｛
     case 0xFF5D: return '}';   // ｝
     case 0xFF5C: return '|';   // ｜
     case 0xFF0B: return '+';   // ＋
     case 0xFF3C: return '\\';  // ＼
     case 0xFF3F: return '_';   // ＿

     // — －
     case 0x2014:               // —
     case 0xFF0D: return '-';   // －

     // 〓 ＝
     case 0x3013:               // 〓
     case 0xFF1D: return '=';   // ＝
 }
 return 0;

isBPMFAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean isBPMFAUC(char c)
```
Bo Po Mo Fo (注音符號).

This is a popular pronunciation system for Mandarin Characters in Taiwan & Hong Kong.
- N N N N N ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏ
- ㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟ
- ㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩ N N N N N N
Parameters:

c - any UTF-8, ASCII or UniCode character available from Plane 0, the Basic Multi-Lingual Plane

Returns:

TRUE if the input character 'c' is in this UTF-8/UniCode range. The HEXADECIMAL / UTF-8 representation of the 'Bo Po Mo Fo' range is: 0x3110 ... 0x3129.

Code:
Exact Method Body:

// 0 1 2 3 4 5 6 7 8 9 a b c d e f // N N N N N ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏ // ㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟ // ㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩ N N N N N N return (c >= 0x3110) && (c <= 0x3129);

endOfSentenceAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char endOfSentenceAUC(char c)
```
Checks for end-of-sentence punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese end-of-sentence punctuation mark - then ASCII-zero is returned.

NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.

SPECIFICALLY: with '.' '?' and '!' as input to this function, ASCII-0 will be returned.

USE: endOfSentence(c) to have those punctuation marks included in non-zero results.
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:
if the input character 'c' is an "alternate UTF-8" version of the punctuation marks:

a period ('.')

an exclamation-point ('!')

a question-mark ('?')

Then the output to this method shall be determined by the table below:

Input Character Output Character

。 ○ ● ． '.' (normal period)

！ '!' (regular exclamation point)

？ '?' (usual question mark)

NOTE: If the normal period, question, or exclamation are passed as input to this function, this function will return ASCII-0
See Also:

endOfSentence(char)

Code:
Exact Method Body:

char auc = punctuationAUC(c); if (auc != 0) c = auc; // A 'switch' is used instead of an 'if' with a char-cast because it is easier to // read on this page. Only the three characters with ASCII 46, 33, and 63 should // return non-zero values. switch ((int) auc) { // These characters identify an "End of Sentence" marker. case 0x2E: return '.'; // DEC: 46 case 0x21: return '!'; // DEC: 33 case 0x3F: return '?'; // DEC: 63 // All other characters should result in a '0' default: return (char) 0; }

endOfSentence

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char endOfSentence(char c)
```
Checks for end-of-sentence punctuation marks. This Helper function is *almost* identitical to the endOfSentenceAUC(c) method.

endOfSentenceAUC(c) returns ASCII-0 for the usual-punctuation marks - '.', '!' and '?'.

endOfSentence(c) does not 'leave-out' or 'deny' these lower-level-ASCII punctuation symbols.
Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

If the input character 'c' is a period ('.'), an exclamation-point ('!'), or a question-mark ('?') - or an AUC version of that punctuation, then that punctuation is returned. Otherwise ASCII-0 is returned.

See Also:

endOfSentenceAUC(char)

Code:
Exact Method Body:

char auc = endOfSentenceAUC(c); if (auc != 0) c = auc; // These three characters identify an "End of Sentence" Marker if ((c == '.') || (c == '!') || (c == '?')) return c; return (char) 0;

endOfPhraseAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char endOfPhraseAUC(char c)

Checks for end-of-phrase punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese phrase-delimiting punctuation mark - then ASCII-zero is returned.

NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.

SPECIFICALLY: with ',' ':' ';' and other common phrase-ending marks in Mandarin as input to this function, ASCII-0 will be returned.

USE: endOfPhrase(c) to have those punctuation marks included in non-zero results.

Parameters:

c - any UTF-8, ASCII or UniCode character available.

Returns:

if the input character 'c' is an "alternate UTF-8" (AUC) version of the punctuation marks:

Punctuation	Symbol and ASCII-Code
semi-colon	';' HEX:0x3B, DEC: 59
comma	',' HEX:0x2C, DEC: 44
colon	':' HEX:0x3A, DEC: 58
double-quote	'\"' HEX:0x22, DEC: 34
single-quote	'\'' HEX:0x27, DEC: 39
left-bracket	'[' HEX:0x5B, DEC: 91
right-bracket	']' HEX:0x5D, DEC: 93
less-than	'<' HEX:0x3C, DEC: 60
greater-than	'>' HEX:0x3E, DEC: 62
left-paren	'(' HEX:0x28, DEC: 40
right-paren	')' HEX:0x29, DEC: 41

IMPORTANT NOTE: *only* the upper-level-UTF-8/UniCode versions of these punctuation marks will produce a non-zero result. An actual ASCII comma, semi-colon, quote, bracket, or parenthesis (etc...) will cause this method to return ASCII-0. Please use endOfPhrase(char) to include the lower-level (Already down-converted ASCII) with non-zero results.

See Also:

endOfPhrase(char)

Code:

Exact Method Body:

 char auc = punctuationAUC(c);

 if (auc != 0) c = auc;

 // A 'switch' is used instead of an 'if' with a char-cast because it is easier to
 // read on this page.  Only the characters having ASCII 59, 44, 58, 34, etc... should
 // return non-zero values.
 switch ((int) auc)
 {
     // These characters constitute an "End of Phrase" marker
     case 0x3B: return ';';	// DEC: 59
     case 0x2C: return ',';	// DEC: 44
     case 0x3A: return ':';	// DEC: 58
     case 0x22: return '\"';	// DEC: 34
     case 0x27: return '\'';	// DEC: 39
     case 0x5B: return '[';	// DEC: 91
     case 0x5D: return ']';	// DEC: 93
     case 0x3C: return '<';	// DEC: 60
     case 0x3E: return '>';	// DEC: 62
     case 0x28: return '(';	// DEC: 40
     case 0x29: return ')';	// DEC: 41

     // All other results should return '0'
     default: return 0;
 }

endOfPhrase

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char endOfPhrase(char c)

endOfPhrase - any version of the end-of-phrase markers usually used in Mandarin Chinese text. This method returns the exact same results as the endOfPhraseAUC(char) method.

EXCEPT: The regular/normal version of that punctuation mark (ASCII for semi-colon, comma, quote, etc...) will return the exact-same semi-colon, comma or quote - instead of ASCII-0

Input & Method Called:	Result
endOfPhrase(';')	';' // Normal ASCII semi-colon symbol
endOfPhraseAUC(';')	0 // ASCII-0 returned
endOfPhrase('】')	']' // left-bracket returned
endOfPhraseAUC('】')	']' // left-bracket returned
endOfPhrase(']')	']' // left-bracket returned
endOfPhraseAUC(']')	0 // ASCII-0 returned

The list of end-of-phrase characters include the following:
';' ',' ':' '\"' '\'' '[' ']' '<' '>' '(' ')'

Parameters:

c - Any character in the entire UniCode range. 0x0000 to 0xFFFF

Returns:

If 'c' is an "AUC" version of and end-of-phrase marker - or a regular lower-level ASCII version - then that punctuation mark is returned. Otherwise 0 is returned.

See Also:

punctuationAUC(char)

Code:

Exact Method Body:

 char auc = punctuationAUC(c);

 if (auc != 0) c = auc;

 if ((c == ';')  ||  (c == ',')  || (c == ':') ||
     (c == '\"') ||  (c == '\'') ||
     (c == '[')  ||  (c == ']')  || 
     (c == '<')  ||  (c == '>')  ||
     (c == '(')  ||  (c == ')'))
     return c;

 return (char) 0;

quoteAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char quoteAUC(char c)
```
Quotes - any version. AUC or normal-ASCII, (BOTH) single or double quote.
Parameters:

c - Any character in the entire UniCode range. 0x0000 to 0xFFFF which is the Basic Multi Lingual Plane.

Returns:

If the input character 'c' is an "AUC" version of the single (or double) quote, or the regular-ASCII single/double quote, then the appropriate single or double-quote is returned. Otherwise 0 is returned.

See Also:

punctuationAUC(char)

Code:
Exact Method Body:

char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x22: return '\"'; // DEC: 34 case 0x27: return '\''; // DEC: 39 default: return (char) 0; }

commaAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char commaAUC(char c)
```
Comma - any version. AUC or normal-ASCII, (BOTH) comma
Parameters:

c - Any character in the entire UTF-8 range. 0x0000 to 0xFFFF, the Basic Multi-Lingual Plane.

Returns:

If the input character 'c' is an "AUC" version of the comma, or the regular-ASCII comma, then the comma is returned. Otherwise 0 is returned.

See Also:

punctuationAUC(char)

Code:
Exact Method Body:

char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x2c: return ','; // DEC: 44 default: return (char) 0; }

bracketAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char bracketAUC(char c)

Brackets - any version. AUC or normal-ASCII, (BOTH) brackets

Parameters:

c - Any character in the entirbrackets UniCode range. 0x0000 to 0xFFFF

Returns:

If the input character 'c' is an "AUC" version of the brackets, or the regular-ASCII brackets, then the appropriate brackets are returned. Otherwise 0 is returned.

See Also:

punctuationAUC(char)

Code:

Exact Method Body:

 char auc = punctuationAUC(c);

 if (auc != 0) c = auc;

 switch ((int) c)
 {
     case 0x5B:  return '[';	// DEC: 91
     case 0x5D:  return ']';	// DEC: 93
     case 0x3C:  return '<';	// DEC: 60
     case 0x3E:  return '>';	// DEC: 62
     default:    return (char) 0;
 }

parenAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static char parenAUC(char c)
```
Parenthesis - any version. AUC or normal-ASCII, (BOTH) parenthesis
Parameters:

c - Any character in the entire UniCode range. 0x0000 to 0xFFFF

Returns:

If the input character 'c' is an "AUC" version of the parenthesis, or the regular-ASCII parenthesis, then the appropriate parenthesis are returned. Otherwise 0 is returned.

See Also:

punctuationAUC(char)

Code:
Exact Method Body:

char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x28: return '('; // DEC: 40 case 0x29: return ')'; // DEC: 41 default: return (char) 0; }

testAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String testAUC()

Returns:

An HTML <TABLE> that contains many tests of the subroutines in this class

Code:

Exact Method Body:

 StringBuilder ret = new StringBuilder();
 ret.append( "<TABLE BORDER=\"1\"><TR>"      +
             "<TD WIDTH=\"30\">&nbsp;</TD>"  +
             "<TD WIDTH=\"70\">&nbsp;</TD>"  +
             "<TD WIDTH=\"70\">&nbsp;</TD>"  +
             "<TD WIDTH=\"30\">&nbsp;</TD>"  );

 for (int i=4; i < 12; i++)
     ret.append("<TD WIDTH=\"70\">&nbsp;</TD>");
 ret.append("</TR>");;

 for (int i=0; i < AUC.length(); i++)
 {
     char c = AUC.charAt(i);

     if (c == ' ') continue;

     // Check original character (not punctuation-converted cc)
     char    bl          = Integer.toString(bulletListAUC(c)).charAt(0);
     boolean bpmf        = isBPMFAUC(c);

     // first, convert the punctuation to normal-ASCII punctuation
     // These are the "translated" characters
     // The "translated character" is where, for example '〗' ==> ']'
     char	newC       = punctuationAUC(c);

     // These are used for building <TABLE> & <TD> entry strings
     char    q           = quoteAUC(newC);
     char    es          = endOfSentenceAUC(newC);
     char    ep          = endOfPhraseAUC(newC);
     char    com         = commaAUC(newC);
     char    br          = bracketAUC(newC);
     char    p           = parenAUC(newC);

     char    ascii       = punctuationAUC(c);
     if (ascii   == 0)   ascii = alphaNumericAUC(c);
     if (bl      != 0)   ascii = bl;
     if (bpmf)           ascii = c;
     if (ascii   == 0)   ascii = 'x';

     // =================================================
     // This is for debugging this test function
     String	tmp =   " newCC = " + newC  + ", q="    + q     +
                     ", es="     + es    + ", ep="   + ep    +
                     ", com="    + com   + ", br="   + br    +
                     ", p="      + p	    + ", bl ="  + bl    +
                     ", bpmf="   + bpmf;

     tmp = tmp.replaceAll("<", "&lt;").replaceAll(">", "&gt;");

     // Build the HTML Table 
     ret.append("<TR>");

     ret.append("<TD>" + c + "</TD>");
     ret.append("<TD>" + ((int) c) + "</TD>");
     ret.append("<TD>" + "0x" + String.format("%x",(int) c).toUpperCase() + "</TD>");
     ret.append("<TD>" + ascii + "</TD>");

     ret.append("<TD>" + ((q		== 0)	? "" : "Quote")		+ "</TD>");	
     ret.append("<TD>" + ((es	== 0)	? "" : "Sentence")	+ "</TD>");
     ret.append("<TD>" + ((ep	== 0)	? "" : "Phrase")	+ "</TD>");
     ret.append("<TD>" + ((com	== 0)	? "" : "Comma")		+ "</TD>");
     ret.append("<TD>" + ((br	== 0)	? "" : "Bracket")	+ "</TD>");
     ret.append("<TD>" + ((p		== 0)	? "" : "Paren")		+ "</TD>");
     ret.append("<TD>" + ((bl	== 0)	? "" : "Bullet")	+ "</TD>"); 
     ret.append("<TD>" + (bpmf ? "BPMF" : "") + "</TD>");

     // ==========================================================
     // Un-Comment this if you want to debug this print function
     // outStr += "</TR><TR><TD COLSPAN=\"12\">" + tmp + "</TD></TR>";

 }
 ret.append("</TABLE>");
 return ret.toString();

countLeadingLettersAndNumbers

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static int countLeadingLettersAndNumbers
            (java.lang.String chineseSentence)
```
Checks for any leading alphabetic ('a' ... 'z') and numeric ('0' ... '9') characters in a Chinese String. CHANGED: 2018.09.24 - I left comma's and period's in the String (when situated between digits). These are considered to be part of the "Leading Letters and Numbers"
Parameters:

chineseSentence - A sentence that may or may not have leading letters & numbers.

Returns:

the String-index of the first non-alphabetic, non-numeric character in the String.

NOTE: white-space does not count, and the position of the first white-space character will be returned, if white-space is contained in this String.

See Also:

isAlphaNumeric(char)

Code:
Exact Method Body:

for (int i = 0; i < chineseSentence.length(); i++) { char c = chineseSentence.charAt(i); if ((! isAlphaNumeric(c)) && (c != '.') && (c != ',')) return i; } return chineseSentence.length(); // This really ought not to happen, but just in case....

convertAnyAUC

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String convertAnyAUC(java.lang.String s)
```
Checks for higher-Unicode letters and numbers, and converts them into lower-level versions of the appropriate letter or number.

SPECIFICALLY: This method is just a "for-loop" which makes a call to alphaNumericaAUC() and if zero is not returned from that method-call, then the input String is modified at the index which contained such a higher UTF-8 letter or number.
Parameters:

s - This may or may not have "Alternate UniCode" Characters for letters and numbers.

Returns:

if the "alternate" versions of 'A' ... 'Z' or '0' ... '9' are there, this will make sure to change them.

See Also:

alphaNumericAUC(char)

Code:
Exact Method Body:

char[] cArr = s.toCharArray(); for (int i = 0; i < cArr.length; i++) { char auc = alphaNumericAUC(cArr[i]); if (auc != 0) cArr[i] = auc; } return new String(cArr);

countSyllablesAndNonChinese

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static int countSyllablesAndNonChinese(java.lang.String word,
                                              java.lang.Appendable DOUT)
                                       throws java.io.IOException

Counts syllables in a "word" of PinYin. The input String is expected to not have any spaces!

NOTE:The number of syllables in a Chinese PinYin "word" identifies the number of Chinese Characters that were used to generate the input PinYin String.

CHANGED: 2018.09.24 - Added a test for periods and commas that are situated directly between two digits. In the String "5.0" the period between 5 and 0 is no longer removed!

If the String "5.0" were passed as the "word" parameter, the result should be 3!

Parameters:

word - A word in the "PinYin" format. (罗马拼音)

DOUT - This must implement java.lang.Appendable

Returns:

the number of syllables (specifically: Chinese Characters) in the input word.

Throws:

java.io.IOException - The interface java.lang.Appendable mandates that the IOException must be treated as a checked exception for all output operations. Therefore IOException is a required exception in this method' throws clause.

Code:

Exact Method Body:

 int numChinese	= 0;

 // Tone-Vowels & Numbers always correspond to a character
 for (int letter = 0; letter < word.length(); letter++)
 {
     char c = word.charAt(letter);
     if (    ZH.isToneVowel(c)   ||
             ZH.isNumber(c)      ||
             (c == '.')          ||
             (c == ',')
         )
         numChinese++;
 }

 // Checks for vowel-strings that don't contain a tone
 // ==> Checks for "clear tone"
 String copyW = "" + word;

 DOUT.append("[" + copyW + "] - ");

 for (int letterIndex = 0; letterIndex < copyW.length(); letterIndex++)
     if (    ! ZH.isRegVowel(copyW.charAt(letterIndex))      &&
             ! ZH.isToneVowel(copyW.charAt(letterIndex))	)
         copyW =	StringParse.setChar(copyW, letterIndex, ' ');
            
 DOUT.append("after erasing non-vowels [" + copyW + "]\n");
         
 String[] syllables = copyW.trim().split(" ");

 DOUT.append("Syllables are:");
 for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++	)
     DOUT.append("[" + syllables[sylIndex] + "]");
 DOUT.append("\n");

 TOP:
 for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++)
 {
     String	syllable    = syllables[sylIndex].trim();
     boolean	foundTone   = false;

     // The split(' ') function sometimes provides blanks
     if (syllable.length() == 0) continue TOP;

     for (int vowelIndex = 0; vowelIndex < syllable.length(); vowelIndex++)
         if (ZH.isToneVowel(syllable.charAt(vowelIndex)))
             continue TOP;

     numChinese++;
     DOUT.append("NOTE: *** FOUND CLEAR TONE\n");
 }

 return numChinese;

delAllPunctuationCHINESE

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String delAllPunctuationCHINESE
            (java.lang.String s)

Deletes all punctuation & non-character symbols. The String that is returned will be shortened by precisely the number of punctuation characters were contained by that String.

NOTE: '.' and ',' (periods and commas) between number/digits are not removed!

Parameters:

s - An input String (in Mandarin - 普通话)

Returns:

a String that is the same as the input String - after skipping characters as follows:

 if (isChinese(c) || isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;
 (else) s = StringParse.delChar(s, chr--);

Code:

Exact Method Body:

 char[]  cArr        = s.toCharArray();
 int     sourcePos   = 0;
 int     destPos     = 0;

 while (sourcePos < cArr.length)
 {
     char c = cArr[sourcePos];

     // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's situated
     // directly between 2 numbers.

     if (    ((c == '.') || (c == ','))
         &&  (((sourcePos-1) == -1)          || isNumber(cArr[sourcePos-1]))
         &&  (((sourcePos+1) == s.length())  || isNumber(cArr[sourcePos+1]))
     )
         { cArr[destPos++] = cArr[sourcePos++]; continue; }

     // AUC were converted before calling this function ... (alphaNumericAUC(c) != 0)) 

     if (isChinese(c) || isAlphaNumeric(c))
         { cArr[destPos++] = cArr[sourcePos++]; continue; }

     sourcePos++;
 }

 return s;

delAllPunctuationPINYIN

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String delAllPunctuationPINYIN(java.lang.String s)

Deletes all punctuation & non-character symbols from a String of PinYin. The returned String will have the same length as it originally did, but the locations where punctuation existed will have been replaced with a space character.

NOTE: '.' and ',' (periods and commas) between number/digits are not removed!

Parameters:

s - An input String in 罗马拼音

Returns:

A String that is the same as the input String - after skipping characters as follows:

 if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;
 (else) s = StringParse.setChar(s, chr, ' ');

Code:

Exact Method Body:

 char[] cArr = s.toCharArray();

 // This loop cnverts all non-AlphaNumeric unicode to a space		
 for (int i = 0; i < cArr.length; i++)
 {
     char c = cArr[i];

     if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue;

     // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's
     // situated directly between 2 numbers.

     if (    ((c == '.') || (c == ','))
         &&  (((i-1) == -1)          || isNumber(cArr[i-1]))
         &&  (((i+1) == s.length())  || isNumber(cArr[i+1]))
     )
         continue;

     cArr[i] = ' ';
 }

 return new String(cArr);

GTPPEIndexOf

🡅 ⇈ ⮫ 🗕 🗗 🗖

public static int GTPPEIndexOf(java.lang.String s,
                               char c)

GTPPE: Google Translate Punctuation Pronunciation Equivalent This searches through a String to find the location of the "equivalent punctuation mark"

Parameters:

s - The input String, expected to be the result of a GCS TS query. This function is totally useless for any Pronunciation String that hasn't been obtained from GCS TS.

NOTE: The input String is intended to be in "PinYin" (罗马拼音)

c - The original punctuation character to look for... Generally, this is used to search for higher-level UTF-8 chars that have been "down-converted" by GCS TS

Returns:

the indexOf() of the character in the original input String. The actual character is not looked for, BUT RATHER, the Google Cloud Server Transation Services equivalent character. Specifically, GCSTS has a "substitute punctuation" for many higher-level UTF-8 and UniCode chars. There are 5 different versions of a quote...

Code:

Exact Method Body:

 int cc = (int) c;

 // if (c == '∶')	return s.indexOf(c);
 if (cc == 0x2236)	return s.indexOf(c);
 // if (c == '：')	return s.indexOf(':');
 if (cc == 0xFF1A)	return s.indexOf(':');	// (0x003A);
 // if (c == ':')	return s.indexOf(c);	// Natural colon
 if (cc == 0x003A)	return s.indexOf(c);

 // commas
 // if (c == '、')	return s.indexOf(',');
 if (cc == 0x3001)	return s.indexOf(',');	// (0x002C);
 // if (c == '，')	return s.indexOf(',');
 if (cc == 0xFF0C)	return s.indexOf(',');	// (0x002C);
 // if (c == ',')	return s.indexOf(c);	// natural comma
 if (cc == 0x002C)	return s.indexOf(c);

 // periods
 // if (c == '。')	return s.indexOf('.');
 if (cc == 0x3002)	return s.indexOf('.');	// (0x002E);
 // if (c == '○')	return s.indexOf(c);
 if (cc == 0x25CB)	return s.indexOf(c);
 // if (c == '●')	return s.indexOf(c);
 if (cc == 0x25CF)	return s.indexOf(c);
 // if (c == '．')	return s.indexOf('.');
 if (cc == 0xFF0E)	return s.indexOf('.');	// (0x002E);
 // if (c == '.')	return s.indexOf(c);	// natural period
 if (cc == 0x002E)	return s.indexOf(c);


 // Exclamation & Question
 // if (c == '?')	return s.indexOf(c);	// natural question-mark
 if (cc == 0x003F)	return s.indexOf(c);
 // if (c == '？')	return s.indexOf('?');
 if (cc == 0xFF1F)	return s.indexOf('?');	// (0x003F);
 // if (c == '！')	return s.indexOf('!');
 if (cc == 0xFF01)	return s.indexOf('!');	// (0x0021);
 // if (c == '!')	return s.indexOf(c);	// natural exclamation
 if (cc == 0x0021)	return s.indexOf(c);

 // single-quotes
 // if (c == '‘')	return s.indexOf(c);
 if (cc == 0x2018)	return s.indexOf(c);
 // if (c == '’')	return s.indexOf(c);
 if (cc == 0x2019)	return s.indexOf(c);
 // if (c == '′')	return s.indexOf(c);
 if (cc == 0x2032)	return s.indexOf(c);
 // if (c == '＇')	return s.indexOf('\'');
 if (cc == 0xFF07)	return s.indexOf('\'');	// (0x0027);
 // if (c == '｀')	return s.indexOf('`');
 if (cc == 0xFF40)	return s.indexOf('`');	// (0x0060);
 // if (c == '\'')	return s.indexOf(c);	// natural single-quotes
 if (cc == 0x0027)	return s.indexOf(c);
 

 // NOT DETECTED RIGHT NOW.. 
 // if (c == '《')	return s.indexOf('“');
 if (cc == 0x300A)	return s.indexOf(CONSTSpecialQuoteLeft);
 // if (c == '》')	return s.indexOf('”');
 if (cc == 0x300B)	return s.indexOf(CONSTSpecialQuoteRight);

 // double-quotes
 // if (c == '¨')	return s.indexOf(c);
 if (cc == 0x00A8)	return s.indexOf(c);
 // if (c == '〃')	return s.indexOf(c);
 if (cc == 0x3003)	return s.indexOf(c);
 // if (c == '“')	return s.indexOf(c);
 if (cc == 0x201C)	return s.indexOf(c);
 // if (c == '”')	return s.indexOf(c);
 if (cc == 0x201D)	return s.indexOf(c);
 // if (c == '″')	return s.indexOf(c);
 if (cc == 0x2033)	return s.indexOf(c);
 // if (c == '＂')	return s.indexOf('\"');
 if (cc == 0xFF02)	return s.indexOf('\"');	// (0x0022);
 // if (c == '\"')	return s.indexOf(c);	// natural double quotes
 if (cc == 0x0022)	return s.indexOf(c);


 // Brackets
 // if (c == '[')	return s.indexOf(c);
 if (cc == 0x005B)	return s.indexOf(c);
 // if (c == ']')	return s.indexOf(c);
 if (cc == 0x005D)	return s.indexOf(c);
 // if (c == '［')	return s.indexOf('[');
 if (cc == 0xFF3B)	return s.indexOf('[');	// (0x005B);
 // if (c == '］')	return s.indexOf(']');
 if (cc == 0xFF3D)	return s.indexOf(']');	// (0x005D);
 // if (c == '【')	return s.indexOf('[');
 if (cc == 0x3010)	return s.indexOf('[');	// (0x005B);
 // if (c == '】')	return s.indexOf(']');
 if (cc == 0x3011)	return s.indexOf(']');	// (0x005D);
 // if (c == '〖')	return s.indexOf(c);
 if (cc == 0x3016)	return s.indexOf(c);
 // if (c == '〗')	return s.indexOf(c);
 if (cc == 0x3017)	return s.indexOf(c);
 // if (c == '『')	return s.indexOf('“');
 if (cc == 0x300E)	return s.indexOf(CONSTSpecialQuoteLeft);
 // if (c == '』')	return s.indexOf('”');
 if (cc == 0x300F)	return s.indexOf(CONSTSpecialQuoteRight);
 // if (c == '「')	return s.indexOf('`');
 if (cc == 0x300C)	return s.indexOf('`');	// (0x0060);
 // if (c == '」')	return s.indexOf('\'');
 if (cc == 0x300D)	return s.indexOf('\'');	// (0x0027);


 // Parenthesis
 // if (c == '(')	return s.indexOf(c);
 if (cc == 0x0028)	return s.indexOf(c);
 // if (c == ')')	return s.indexOf(c);
 if (cc == 0x0029)	return s.indexOf(c);
 // if (c == '（')	return s.indexOf('(');
 if (cc == 0xFF08)	return s.indexOf('(');	// (0x0028);
 // if (c == '）')	return s.indexOf(')');
 if (cc == 0xFF09)	return s.indexOf(')');	// (0x0029);
 // if (c == '〔')	return s.indexOf(c);
 if (cc == 0x3014)	return s.indexOf(c);
 // if (c == '〕')	return s.indexOf(c);
 if (cc == 0x3015)	return s.indexOf(c);

 System.out.println("character not found: \'" + c + "\'\nZH.GTPPEIndexOf(String s, char c)");
 System.exit(0);
 return 0;

Input Character	Output Character
。 ○ ● ．	'.' (normal period)
！	'!' (regular exclamation point)
？	'?' (usual question mark)

Class ZH

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

AUC

CONSTSpecialQuoteLeft

CONSTSpecialQuoteRight

Method Detail

toneVowelToRegularVowel

countToneVowels

toneVowelsToRegularVowels

HTML2ChineseVowels

HTML2UTF8

formatUTF8Chinese

isChinese

isOther

isAlphaNumeric

isAlpha

isToneVowel

isRegVowel

isRegLetter

isNumber

isSpace

bulletListAUC

alphaNumericAUC

punctuationAUC

isBPMFAUC

endOfSentenceAUC

endOfSentence

endOfPhraseAUC

endOfPhrase

quoteAUC

commaAUC

bracketAUC

parenAUC

testAUC

countLeadingLettersAndNumbers

convertAnyAUC

countSyllablesAndNonChinese

delAllPunctuationCHINESE

delAllPunctuationPINYIN

GTPPEIndexOf