Package Torello.Languages
Class ZH
- java.lang.Object
-
- Torello.Languages.ZH
-
public class ZH extends java.lang.Object
ZH (Mandarin Chinese) Many tools for parsing constructs from Mandarin News & other Web-Sites.
A series of simple Helper Routines for inspecting the specialUTF-8
(non-Mandarin) characters often used in Mandarin HTML Web-Pages.
Hi-Lited Source-Code:- View Here: Torello/Languages/ZH.java
- Open New Browser-Tab: Torello/Languages/ZH.java
File Size: 58,676 Bytes Line Count: 1,469 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 35 Method(s), 35 declared static
- 7 Field(s), 7 declared static, 7 declared final
-
-
Field Summary
Fields Modifier and Type Field Description static String
AUC
The complete list of "higher-level" (alternate) Uni-Code chars.static char
CONSTSpecialQuoteLeft
Special Quotation Mark, left-sidestatic char
CONSTSpecialQuoteRight
Special Quotation Mark, right-side
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static char
alphaNumericAUC(char c)
Alpha-Numeric character code from upper UniCode / UTF-8
These characters exist in UTF-8 - but they ARE NOT the usual ASCII characters for the letters'A' ... 'Z'
or the numbers'0' ... '9'
They, however, are sometimes found in documents on Chinese News Websites, etc.static char
bracketAUC(char c)
Brackets - any version.static int
bulletListAUC(char c)
Bullet List characters in upperUniCode / UTF-8
.static char
commaAUC(char c)
Comma - any version.static String
convertAnyAUC(String s)
Checks for higher-Unicode letters and numbers, and converts them into lower-level versions of the appropriate letter or number.static int
countLeadingLettersAndNumbers(String chineseSentence)
Checks for any leading alphabetic('a' ... 'z')
and numeric('0' ... '9')
characters in a ChineseString
.static int
countSyllablesAndNonChinese(String word, Appendable DOUT)
Counts syllables in a "word" of PinYin.static int
countToneVowels(String pinYinStr)
Counts the number of tone vowels in a PinYinString
.static String
delAllPunctuationCHINESE(String s)
Deletes all punctuation & non-character symbols.static String
delAllPunctuationPINYIN(String s)
Deletes all punctuation & non-character symbols from aString
of PinYin.static char
endOfPhrase(char c)
endOfPhrase - any version of the end-of-phrase markers usually used in Mandarin Chinese text.static char
endOfPhraseAUC(char c)
Checks for end-of-phrase punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark.static char
endOfSentence(char c)
Checks for end-of-sentence punctuation marks.static char
endOfSentenceAUC(char c)
Checks for end-of-sentence punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark.static String
formatUTF8Chinese(char c)
This is used to convert a Chinese Character into a fullString
that includes the UTF-8 code represented as aHEXADECIMAL
number and adecimal
numberstatic int
GTPPEIndexOf(String s, char c)
GTPPE: Google Translate Punctuation Pronunciation Equivalent This searches through aString
to find the location of the "equivalent punctuation mark"static String
HTML2ChineseVowels(String s)
Google Translate returns some text encoded as"&#num;" (the "ord(c)").
This is also calledHTML Escaped Code
- because instead of actual ASCII/UTF8 characters themselves, their "Ord" are returned - surrounded by the usual HTML Escape Character Sequence &#num; This method does thechr(html-hex-escape-code);
and replaces theescape-sequence
(which again is &#NUM;) with the actual ASCII character.static String
HTML2UTF8(String s)
NOTE: This does the same asHTML2ChineseVowels(String)
EXCEPT that it converts ANY HTML string that has been encoded as:&#NUM;
- not just the characters having accents and corresponding to Chinese Tone Vowels.static boolean
isAlpha(char c)
Checks if achar
is Alphabetic.static boolean
isAlphaNumeric(char c)
Checks if achar
is Alpha Numberic.static boolean
isBPMFAUC(char c)
Bo Po Mo Fo (注音符號).static boolean
isChinese(char c)
Helper function - checks if this is a character in the UTF-8 & ASCII ranges that contain Mandarin Chinese characters.static boolean
isNumber(char c)
Regular Numbers Include:'0' ... '9'
static boolean
isOther(char c)
Checks achar
is something that is notAlpha Numeric
orWhite Space
static boolean
isRegLetter(char c)
Regular Letters Include:'A' ... 'Z'
(65 - 90),'a' ... 'z'
(97 - 122)static boolean
isRegVowel(char c)
Checks that a character is a standard vowel.static boolean
isSpace(char c)
Checks for WhiteSpace:'\t', '\n', '\r', ' '
static boolean
isToneVowel(char c)
This is a helper function for the Mandarin Chinese accented vowel symbols inUTF-8, ASCII
andUniCode
.static char
parenAUC(char c)
Parenthesis - any version.static char
punctuationAUC(char c)
This method,punctuationAUC(char)
, converts any characters which are common on many Mandarin Chinese websites into a lower-level, more typical/normal ASCII equivalent.static char
quoteAUC(char c)
Quotes - any version.static String
testAUC()
static String
toneVowelsToRegularVowels(String s)
This performs a conversion of all vowels in aString
from those with tones over them to the normal (un-accented) equivalent.static char
toneVowelToRegularVowel(char c)
This makes the problems of dealing with the tone/accent marks above vowels in Chinese Pin-Yin easier.
-
-
-
Field Detail
-
AUC
public static final java.lang.String AUC
The complete list of "higher-level" (alternate) Uni-Code chars. Many of these are alternate punctuation marks used in documents that contain Mandarin Chinese.- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final String AUC = // Special Punctuation characters found in Chinese HTML Pages "、 。 · ˉ ˇ ¨ 〃 々 — ~ ‖ … ‘ ’ " + "“ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】" + "± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠" + "⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ " + "∴ ♂ ♀ ° ′ ″ ℃ $ ¤ ¢ £ ‰ § № ☆ ★" + "○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 " + "! " # ¥ % & ' ( ) * + , - . /" + // Extra Alphabetic and Numeric Characters sometimes used // on web-pages written in Chinese "0 1 2 3 4 5 6 7 8 9 : ; < = > ?" + "@ A B C D E F G H I J K L M N O" + "P Q R S T U V W X Y Z [ \ ] ^ _" + "` a b c d e f g h i j k l m n o" + "p q r s t u v w x y z { | }  ̄" + // Certain "Bullet List" / "Bullet Point" markers "⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖" + "⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾" + "⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦" + "⑧ ⑨ ⑩ ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩" + "Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ" + // The "Bo Po Mo Fo" Pronunciation Used for Chinese Characters "ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ" + "ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ";
-
CONSTSpecialQuoteLeft
public static final char CONSTSpecialQuoteLeft
Special Quotation Mark, left-side- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final char CONSTSpecialQuoteLeft = (char) 0x201C;
-
CONSTSpecialQuoteRight
public static final char CONSTSpecialQuoteRight
Special Quotation Mark, right-side- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final char CONSTSpecialQuoteRight = (char) 0x201D;
-
-
Method Detail
-
toneVowelToRegularVowel
public static char toneVowelToRegularVowel(char c)
This makes the problems of dealing with the tone/accent marks above vowels in Chinese Pin-Yin easier. These convert vowels with tones over them into regular vowels. This can be useful for certainString
operations, although clearly the original meaning of the word would be decimated.- Parameters:
c
- any character from ASCII / UTF-8 / UniCode Basic Multi Lingual Plane.- Returns:
- if this is a
UTF-8
character that is an accented vowel, the un-accented version of that vowel is returned. If this is not a PinYin symbol for a tone-vowel,ASCII 0
is returned. - See Also:
toneVowelsToRegularVowels(String)
- Code:
- Exact Method Body:
for (int i=0; i < CV.length; i++) if (CV[i] == c) return CV2RV[i]; return (char) 0;
-
countToneVowels
public static int countToneVowels(java.lang.String pinYinStr)
Counts the number of tone vowels in a PinYinString
.- Parameters:
pinYinStr
- AString
, usually generated by Google Translate, (and scraped from Google Translate) that contains PinYin.- Returns:
- The number of Mandarin Chinese Pin-Yin "Tone Vowels"
- Code:
- Exact Method Body:
int count=0; TOP: for (int i = pinYinStr.length()-1; i >= 0; i--) for (int j=0; j < CV.length; j++) if (pinYinStr.charAt(i) == CV[j]) { count++; continue TOP; } return count;
-
toneVowelsToRegularVowels
public static java.lang.String toneVowelsToRegularVowels (java.lang.String s)
This performs a conversion of all vowels in aString
from those with tones over them to the normal (un-accented) equivalent. It uses the single-character-version of the synonymously named method- Parameters:
s
- anyjava.lang.String
containing Mandarin Romanizations.- Returns:
- a
String
with all accented vowel's converted to regular vowels. - See Also:
toneVowelToRegularVowel(char)
- Code:
- Exact Method Body:
int strlen = s.length(); StringBuilder sb = new StringBuilder(s.length()); char c; for (int i=0; i < strlen; i++) if ((c = toneVowelToRegularVowel(s.charAt(i))) != 0) sb.append(c); else sb.append(s.charAt(i)); return sb.toString();
-
HTML2ChineseVowels
public static java.lang.String HTML2ChineseVowels(java.lang.String s)
Google Translate returns some text encoded as"&#num;" (the "ord(c)").
This is also calledHTML Escaped Code
- because instead of actual ASCII/UTF8 characters themselves, their "Ord" are returned - surrounded by the usual HTML Escape Character Sequence &#num; This method does thechr(html-hex-escape-code);
and replaces theescape-sequence
(which again is &#NUM;) with the actual ASCII character.
NOTE: all of these are for "Chinese Tone Vowel" ASCII - The Google Translate module uses this method quite a bit. Here are a few examples of HTML-Escape-Sequence and the corresponding ASCII.HTML-Escaped ASCII/UTF-8 Character À À á á ě ě ū ū ǚ ǚ ... see array below for list
NOTE:HTML2UTF8(String)
==> This method does the exact same thing - but does not limit the characters to be converted to only Chinese Tone Vowels. This method only converts HTML-Escaped-Characters from this list:private static final int[] H2CV = { 39, 192, 201, 224, 225, 232, 233, 236, 237, 242, 243,
249, 250, 252, 256, 257, 275, 283, 299, 333, 363, 462, 464, 466, 468, 474, 476 };
- See Also:
HTML2UTF8(String)
- Code:
- Exact Method Body:
for (int i=0; i < H2CV.length; i++) s = s.replaceAll("&#" + H2CV[i] + ";", "" + (char) H2CV[i]); return s;
-
HTML2UTF8
public static java.lang.String HTML2UTF8(java.lang.String s)
NOTE: This does the same asHTML2ChineseVowels(String)
EXCEPT that it converts ANY HTML string that has been encoded as:&#NUM;
- not just the characters having accents and corresponding to Chinese Tone Vowels.- See Also:
HTML2ChineseVowels(String)
- Code:
- Exact Method Body:
// Build the list of UTF8/ASCII character values (as Ord(c) / int) first. HashSet<Integer> utfList = new HashSet<Integer>(); Matcher m = P1.matcher(s); while (m.find()) utfList.add(Integer.parseInt(m.group(1))); // Now convert them. for (Integer i : utfList) s = s.replaceAll("&#" + i.toString() + ";", "" + ((char) i.intValue())); return s;
-
formatUTF8Chinese
public static java.lang.String formatUTF8Chinese(char c)
This is used to convert a Chinese Character into a fullString
that includes the UTF-8 code represented as aHEXADECIMAL
number and adecimal
number- Parameters:
c
- any ASCII/UniCode/UTF-8 char - but, generally, expected to be a "Chinese Character."
NOTE: The choice for parameterchar c
has no actual constraints on its input value.- Returns:
- A
String
of this format:掭(0x63AD, 25517)
- Code:
- Exact Method Body:
return c + "(0x" + String.format("%x", ((int) c)).toUpperCase() + ", " + ((int) c) + ")";
-
isChinese
public static boolean isChinese(char c)
Helper function - checks if this is a character in the UTF-8 & ASCII ranges that contain Mandarin Chinese characters. This is not guaranteed to be accurate - some non-Chinese Japanese characters exist in this range. For the precise definition of what this function actually does, see the ranges printed below.
COPIED FROM***
http://www.khngai.com/chinese/charmap/tbluni.php?page=0
AND:((c >= 0x4E00) && (c <= 0x9FFF))
COPIED FROM***
http://www.khngai.com/chinese/charmap/tblgb.php?page=1
- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
TRUE
if the input character'c'
is in the UTF-8/UniCode range for Chinese Characters- Code:
- Exact Method Body:
if ((c >= 0x4E00) && (c <= 0x9FFF)) return true; if ((c >= 0xB0A0) && (c <= 0xBFFF)) return true; if ((c >= 0xC0A0) && (c <= 0xCFFF)) return true; if ((c >= 0xD0A0) && (c <= 0xDFFF)) return true; if ((c >= 0xE0A0) && (c <= 0xEFFF)) return true; if ((c >= 0xF0A0) && (c <= 0xF7FF)) return true; return false;
-
isOther
public static boolean isOther(char c)
Checks achar
is something that is notAlpha Numeric
orWhite Space
- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
((!isAlphaNumeric(c)) && (!isSpace(c)));
- Code:
- Exact Method Body:
return ((!isAlphaNumeric(c)) && (!isSpace(c)));
-
isAlphaNumeric
public static boolean isAlphaNumeric(char c)
Checks if achar
is Alpha Numberic.- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
(isAlpha(c) || isNumber(c));
- Code:
- Exact Method Body:
return (isAlpha(c) || isNumber(c));
-
isAlpha
public static boolean isAlpha(char c)
Checks if achar
is Alphabetic.- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
(isToneVowel(c) || isRegVowel(c) || isRegLetter(c));
- Code:
- Exact Method Body:
return (isToneVowel(c) || isRegVowel(c) || isRegLetter(c));
-
isToneVowel
public static boolean isToneVowel(char c)
This is a helper function for the Mandarin Chinese accented vowel symbols inUTF-8, ASCII
andUniCode
. The exact character code numbers are printed below.
NOTE: In 罗马拼音 (Pin-Yin Romanization), there are a few symbols that should never come up - at least as the software pertains to 罗马拼音-results provided by Google Cloud Server Translation API(GCS-TS/TAPI)
. This is because NO word in Pin-Yin ever starts with the letter's I or U, or the U with an umlau - so - capitalized versions of these letters ought to never occur - unless the entire PinYin were capitalized - which is something GCSTS never does.- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
TRUE
if the input character'c'
is one of the following:Simple ASCII UTF-8 Tone Vowel a ā (257), á (225), ǎ (462), à (224) e ē (275), é (233), ě (283), è (232) i ī (299), í (237), ǐ (464), ì (236) o ō (333), ó (243), ǒ (466), ò (242) u ū (363), ú (250), ǔ (468), ù (249) u ǖ (470), ǘ (472), ǚ (474), ǜ (476) A Ā (256), Á (193), Ǎ (461), À (192) E Ē (274), É (201), Ě (282), È (200) O Ō (332), Ó (211), Ǒ (465), Ò (210)
In Mandarin Chinese, PinYin-words cannot start with these letters below. Therefore it would be highly unlikely to see a "capitalized" version of these tone-vowels.Simple ASCII UTF-8 Tone Vowel I Ī (298), Í (205), (there are 2: Ǐ (463), Ĭ (300)), Ì (204) U Ū (362), Ú (218), Ŭ (364), Ù (217) U (Ü (220) -no tone): Ǖ (469), Ǘ (471), Ǘ (473), Ǜ (475) - Code:
- Exact Method Body:
// A, ā 257, á 225, ǎ 462, à 224 if ((c == 257) || (c == 225) || (c == 462) || (c == 224)) return true; // E, ē 275, é 233, ě 283, è 232 if ((c == 275) || (c == 233) || (c == 283) || (c == 232)) return true; // I, ī 299, í 237, ǐ 464, ì 236 if ((c == 299) || (c == 237) || (c == 464) || (c == 236)) return true; // O, ō 333, ó 243, ǒ 466, ò 242 if ((c == 333) || (c == 243) || (c == 466) || (c == 242)) return true; // U, ū 363, ú 250, ǔ 468, ù 249 if ((c == 363) || (c == 250) || (c == 468) || (c == 249)) return true; // U, ǖ 470, ǘ 472, ǚ 474, ǜ 476 if ((c == 470) || (c == 472) || (c == 474) || (c == 476)) return true; // ******* // Capital vowels with tone symbols // Ā 256, Á 193, Ǎ 461, À 192 if ((c == 256) || (c == 193) || (c == 461) || (c == 192)) return true; // Ē 274, É 201, Ě 282, È 200 if ((c == 274) || (c == 201) || (c == 282) || (c == 200)) return true; // Ō 332, Ó 211, Ǒ 465, Ò 210 if ((c == 332) || (c == 211) || (c == 465) || (c == 210)) return true; // Not sure about these - found them on a website // ********************************************** // 1234 5678 9ABC DEF // A8A0 āáǎà ēéěè īíǐì ōóǒ // // 0 1234 5678 9 A // A8B0 ò ūúǔù ǖǘǚǜ ü ê // ********************************************** if ((c >= 0xA8A1) && (c <= 0xA8Ba)) return true; return false;
-
isRegVowel
public static boolean isRegVowel(char c)
Checks that a character is a standard vowel.- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
TRUE
if the input character'c'
EQUALS one of these ten letters: a, e, i, o, u, A, E, I, O, U- Code:
- Exact Method Body:
// The normal vowels // a 97, A 65 if ((c == 97) || (c == 65)) return true; // e 101, E 69 if ((c == 101) || (c == 69)) return true; // i 105, I 73 if ((c == 105) || (c == 73)) return true; // o 111, O 79 if ((c == 111) || (c == 79)) return true; // u 117, U 85 if ((c == 117) || (c == 85)) return true; return false;
-
isRegLetter
public static boolean isRegLetter(char c)
Regular Letters Include:'A' ... 'Z'
(65 - 90),'a' ... 'z'
(97 - 122)- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
TRUE
if the input character'c'
is any letter in lower-level ASCII (and not any of the AUC).- Code:
- Exact Method Body:
return ((c >= 65) && (c <= 90)) || ((c >= 97) && (c <= 122));
-
isNumber
public static boolean isNumber(char c)
Regular Numbers Include:'0' ... '9'
- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
TRUE
if the input character'c'
is in the range of ASCII'0' ... '9'
(not any of the AUC)- Code:
- Exact Method Body:
return ((c >= 48) && (c <= 57));
-
isSpace
public static boolean isSpace(char c)
Checks for WhiteSpace:'\t', '\n', '\r', ' '
- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available.- Returns:
TRUE
if the input character'c'
is a whitespace character code from the above list- Code:
- Exact Method Body:
return ((c == 9) || (c == 12) || (c == 15) || (c == 32));
-
bulletListAUC
public static int bulletListAUC(char c)
Bullet List characters in upperUniCode / UTF-8
. These characters exist in UTF-8 - and they are occasionally used in documents found on Chinese News Websites. They are all "bullet-list" points. An integer is returned for each of these, that is equal to the number represented by the UTF-8/UniCode character here.- 0 1 2 3 4 5 6 7 8 9 a b c d e f
- N ⒈ ⒉ ⒊ ⒋ ⒌ ⒍ ⒎ ⒏ ⒐ ⒑ ⒒ ⒓ ⒔ ⒕ ⒖
- ⒗ ⒘ ⒙ ⒚ ⒛ ⑴ ⑵ ⑶ ⑷ ⑸ ⑹ ⑺ ⑻ ⑼ ⑽ ⑾
- ⑿ ⒀ ⒁ ⒂ ⒃ ⒄ ⒅ ⒆ ⒇ ① ② ③ ④ ⑤ ⑥ ⑦
- ⑧ ⑨ ⑩ N N ㈠ ㈡ ㈢ ㈣ ㈤ ㈥ ㈦ ㈧ ㈨ ㈩ N
- N Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ
- Parameters:
c
- any character as input- Returns:
- The number equivalent represented by this bullet point.
- Code:
- Exact Method Body:
// ⒈ ==> ⒛ if ((c >= 0x2488) && (c <= 0x249B)) return ((int) c) - 0x2487; // ⑴ ==> ⒇ if ((c >= 0x2474) && (c <= 0x2487)) return ((int) c) - 0x2473; // ① ==> ⑩ if ((c >= 0x2460) && (c <= 0x2469)) return ((int) c) - 0x245F; // ㈠ ==> ㈩ if ((c >= 0x3220) && (c <= 0x3229)) return ((int) c) - 0x321F; // Ⅰ ==> Ⅻ if ((c >= 0x2160) && (c <= 0x216B)) return ((int) c) - 0x215F; return 0;
-
alphaNumericAUC
public static char alphaNumericAUC(char c)
Alpha-Numeric character code from upper UniCode / UTF-8
These characters exist in UTF-8 - but they ARE NOT the usual ASCII characters for the letters'A' ... 'Z'
or the numbers'0' ... '9'
They, however, are sometimes found in documents on Chinese News Websites, etc.
Copied from:
http://www.khngai.com/chinese/charmap/tblgb.php?page=0
- 0 1 2 3 4 5 6 7 8 9 a b c d e f
- ! " # ¥ % & ' ( ) * + , - . /
- 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
- @ A B C D E F G H I J K L M N O
- P Q R S T U V W X Y Z [ \ ] ^ _
- a b c d e f g h i j k l m n o
- p q r s t u v w x y z { | }  ̄
- Parameters:
c
- any character as input- Returns:
- the "lower-level-ASCII" version of that character.
- Code:
- Exact Method Body:
// ASCII 'A' is 65 if ((c > 0xFF20) && (c < 0xFF3B)) return (char) (65 + (c - 0xFF21)); // ASCII 'a' is 97 if ((c > 0xFF40) && (c < 0xFF5B)) return (char) (97 + (c - 0xFF41)); // ASCII '0' is 48 if ((c >= 0xFF10) && (c <= 0xFF1A)) return (char) (48 + (c - 0xFF10)); return 0;
-
punctuationAUC
public static char punctuationAUC(char c)
This method,punctuationAUC(char)
, converts any characters which are common on many Mandarin Chinese websites into a lower-level, more typical/normal ASCII equivalent. This is can be very useful when trying to make sense of brackets, parenthesis, quotes, commas and other punctuation marks - and quickly convert them into a simple version of the character.
If the input character has an "Alternate Version" in the lower-level-ASCII range, that lower level ASCII character is returned. If this isn't AUC, ASCII-0 is returned.
For Instance:Input Output 〖 〗 【 】 [ ] [ ] 。 ○ ● . . (ASCII-period) ¨ 〃 “ ” ″ " " (ASCII-double-quote) , (ASCII-comma) ASCII-0 + (ASCII-plus) ASCII-0 - Parameters:
c
- any character as input- Returns:
- the "lower-level-ASCII" version of that character
NOTE: ASCII-0 is returned if this is not a valid "AUC"UTF-8 / UniCode
code! - Code:
- Exact Method Body:
// Copied from: // *** http://www.khngai.com/chinese/charmap/tblgb.php?page=0 // // 0 2 3 4 5 6 7 8 9 a b c d e f // N N 、 。 · ˉ ˇ ¨ 〃 々 — ~ ‖ … ‘ ’ // “ ” 〔 〕 〈 〉 《 》 「 」 『 』 〖 〗 【 】 // ± × ÷ ∶ ∧ ∨ ∑ ∏ ∪ ∩ ∈ ∷ √ ⊥ ∥ ∠ // ⌒ ⊙ ∫ ∮ ≡ ≌ ≈ ∽ ∝ ≠ ≮ ≯ ≤ ≥ ∞ ∵ // ∴ ♂ ♀ ° ′ ″ ℃ $ ¤ ¢ £ ‰ § № ☆ ★ // ○ ● ◎ ◇ ◆ □ ■ △ ▲ ※ → ← ↑ ↓ 〓 // // 0 1 2 3 4 5 6 7 8 9 a b c d e f // ! " # ¥ % & ' ( ) * + , - . / // 0 1 2 3 4 5 6 7 8 9 : ; < = > ? // @ A B C D E F G H I J K L M N O // P Q R S T U V W X Y Z [ \ ] ^ _ // ` a b c d e f g h i j k l m n o // p q r s t u v w x y z { | }  ̄ switch (c) { // 、 , case 0x3001: // 、 case 0xFF0C: return ','; // , // 。 ○ ● . case 0x3002: // 。 case 0x25CB: // ○ case 0x25CF: // ● case 0xFF0E: return '.'; // . // ‘ ’ ′ ' ` case 0x2018: // ‘ case 0x2019: // ’ case 0x2032: // ′ case 0xFF07: // ' case 0xFF40: return '\''; // ` // ¨ 〃 “ ” ″ " case 0x00A8: // ¨ case 0x3003: // 〃 case 0x201C: // “ case 0x201D: // ” case 0x2033: // ″ case 0xFF02: return '\"'; // " // 〔 ( case 0x3014: // 〔 case 0xFF08: return '('; // ( // 〕 ) case 0x3015: // 〕 case 0xFF09: return ')'; // ) // 〈 < case 0x3008: // 〈 case 0xFF1C: return '<'; // < // 〉 > case 0x3009: // 〉 case 0xFF1E: return '>'; // > // 「 『 〖 【 [ case 0x300C: // 「 case 0x300E: // 『 case 0x3016: // 〖 case 0x3010: // 【 case 0xFF3B: return '['; // [ // 」 』 〗】 ] case 0x300D: // 」 case 0x300F: // 』 case 0x3017: // 〗 case 0x3011: // 】 case 0xFF3D: return ']'; // ] // ∶ : case 0x2236: // ∶ case 0xFF1A: return ':'; // : case 0xFF01: return '!'; // ! case 0xFF03: return '#'; // # case 0xFF05: return '%'; // % case 0xFF06: return '&'; // & case 0xFF1F: return '?'; // ? case 0xFF0F: return '/'; // / case 0xFF3E: return '^'; // ^ case 0xFF5B: return '{'; // { case 0xFF5D: return '}'; // } case 0xFF5C: return '|'; // | case 0xFF0B: return '+'; // + case 0xFF3C: return '\\'; // \ case 0xFF3F: return '_'; // _ // — - case 0x2014: // — case 0xFF0D: return '-'; // - // 〓 = case 0x3013: // 〓 case 0xFF1D: return '='; // = } return 0;
-
isBPMFAUC
public static boolean isBPMFAUC(char c)
Bo Po Mo Fo (注音符號).
This is a popular pronunciation system for Mandarin Characters in Taiwan & Hong Kong.- N N N N N ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ
- ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ
- ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ N N N N N N
- Parameters:
c
- anyUTF-8, ASCII
orUniCode
character available fromPlane 0
, the Basic Multi-Lingual Plane- Returns:
TRUE
if the input character'c'
is in this UTF-8/UniCode range. TheHEXADECIMAL / UTF-8
representation of the 'Bo Po Mo Fo' range is:0x3110 ... 0x3129
.- Code:
- Exact Method Body:
// 0 1 2 3 4 5 6 7 8 9 a b c d e f // N N N N N ㄅ ㄆ ㄇ ㄈ ㄉ ㄊ ㄋ ㄌ ㄍ ㄎ ㄏ // ㄐ ㄑ ㄒ ㄓ ㄔ ㄕ ㄖ ㄗ ㄘ ㄙ ㄚ ㄛ ㄜ ㄝ ㄞ ㄟ // ㄠ ㄡ ㄢ ㄣ ㄤ ㄥ ㄦ ㄧ ㄨ ㄩ N N N N N N return (c >= 0x3110) && (c <= 0x3129);
-
endOfSentenceAUC
public static char endOfSentenceAUC(char c)
Checks for end-of-sentence punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese end-of-sentence punctuation mark - then ASCII-zero is returned.
NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.
SPECIFICALLY: with'.' '?'
and'!'
as input to this function, ASCII-0 will be returned.
USE:endOfSentence(c)
to have those punctuation marks included in non-zero results.- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
- if the input character
'c'
is an "alternate UTF-8" version of the punctuation marks:- a period ('.')
- an exclamation-point ('!')
- a question-mark ('?')
Then the output to this method shall be determined by the table below:Input Character Output Character 。 ○ ● . '.' (normal period) ! '!' (regular exclamation point) ? '?' (usual question mark)
NOTE: If the normal period, question, or exclamation are passed as input to this function, this function will return ASCII-0 - See Also:
endOfSentence(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; // A 'switch' is used instead of an 'if' with a char-cast because it is easier to // read on this page. Only the three characters with ASCII 46, 33, and 63 should // return non-zero values. switch ((int) auc) { // These characters identify an "End of Sentence" marker. case 0x2E: return '.'; // DEC: 46 case 0x21: return '!'; // DEC: 33 case 0x3F: return '?'; // DEC: 63 // All other characters should result in a '0' default: return (char) 0; }
-
endOfSentence
public static char endOfSentence(char c)
Checks for end-of-sentence punctuation marks. This Helper function is *almost* identitical to theendOfSentenceAUC(c)
method.endOfSentenceAUC(c)
returns ASCII-0 for the usual-punctuation marks -'.', '!'
and'?'
.endOfSentence(c)
does not 'leave-out' or 'deny' these lower-level-ASCII punctuation symbols.- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
- If the input character
'c'
is a period('.')
, an exclamation-point('!')
, or a question-mark('?')
- or an AUC version of that punctuation, then that punctuation is returned. Otherwise ASCII-0 is returned. - See Also:
endOfSentenceAUC(char)
- Code:
- Exact Method Body:
char auc = endOfSentenceAUC(c); if (auc != 0) c = auc; // These three characters identify an "End of Sentence" Marker if ((c == '.') || (c == '!') || (c == '?')) return c; return (char) 0;
-
endOfPhraseAUC
public static char endOfPhraseAUC(char c)
Checks for end-of-phrase punctuation marks - and "down-converts" them to the simple ASCII equivalent version of that punctuation mark. If the input character code is not an AUC version of a typical Mandarin-Chinese phrase-delimiting punctuation mark - then ASCII-zero is returned.
NOTE: if a lower-level-ASCII (normal) punctuation mark is input - then ASCII-0 is returned.
SPECIFICALLY: with',' ':' ';'
and other common phrase-ending marks in Mandarin as input to this function, ASCII-0 will be returned.
USE:endOfPhrase(c)
to have those punctuation marks included in non-zero results.- Parameters:
c
- any UTF-8, ASCII or UniCode character available.- Returns:
- if the input character
'c'
is an "alternate UTF-8" (AUC) version of the punctuation marks:Punctuation Symbol and ASCII-Code semi-colon ';' HEX:0x3B, DEC: 59 comma ',' HEX:0x2C, DEC: 44 colon ':' HEX:0x3A, DEC: 58 double-quote '\"' HEX:0x22, DEC: 34 single-quote '\'' HEX:0x27, DEC: 39 left-bracket '[' HEX:0x5B, DEC: 91 right-bracket ']' HEX:0x5D, DEC: 93 less-than '<' HEX:0x3C, DEC: 60 greater-than '>' HEX:0x3E, DEC: 62 left-paren '(' HEX:0x28, DEC: 40 right-paren ')' HEX:0x29, DEC: 41
IMPORTANT NOTE: *only* the upper-level-UTF-8/UniCode versions of these punctuation marks will produce a non-zero result. An actual ASCII comma, semi-colon, quote, bracket, or parenthesis (etc...) will cause this method to return ASCII-0. Please use endOfPhrase(char) to include the lower-level (Already down-converted ASCII) with non-zero results. - See Also:
endOfPhrase(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; // A 'switch' is used instead of an 'if' with a char-cast because it is easier to // read on this page. Only the characters having ASCII 59, 44, 58, 34, etc... should // return non-zero values. switch ((int) auc) { // These characters constitute an "End of Phrase" marker case 0x3B: return ';'; // DEC: 59 case 0x2C: return ','; // DEC: 44 case 0x3A: return ':'; // DEC: 58 case 0x22: return '\"'; // DEC: 34 case 0x27: return '\''; // DEC: 39 case 0x5B: return '['; // DEC: 91 case 0x5D: return ']'; // DEC: 93 case 0x3C: return '<'; // DEC: 60 case 0x3E: return '>'; // DEC: 62 case 0x28: return '('; // DEC: 40 case 0x29: return ')'; // DEC: 41 // All other results should return '0' default: return 0; }
-
endOfPhrase
public static char endOfPhrase(char c)
endOfPhrase - any version of the end-of-phrase markers usually used in Mandarin Chinese text. This method returns the exact same results as theendOfPhraseAUC(char)
method.
EXCEPT: The regular/normal version of that punctuation mark (ASCII for semi-colon, comma, quote, etc...) will return the exact-same semi-colon, comma or quote - instead of ASCII-0Input & Method Called: Result endOfPhrase(';') ';' // Normal ASCII semi-colon symbol endOfPhraseAUC(';') 0 // ASCII-0 returned endOfPhrase('】') ']' // left-bracket returned endOfPhraseAUC('】') ']' // left-bracket returned endOfPhrase(']') ']' // left-bracket returned endOfPhraseAUC(']') 0 // ASCII-0 returned
The list of end-of-phrase characters include the following:
';' ',' ':' '\"' '\'' '[' ']' '<' '>' '(' ')'
- Parameters:
c
- Any character in the entire UniCode range. 0x0000 to 0xFFFF- Returns:
- If
'c'
is an "AUC" version of and end-of-phrase marker - or a regular lower-level ASCII version - then that punctuation mark is returned. Otherwise 0 is returned. - See Also:
punctuationAUC(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; if ((c == ';') || (c == ',') || (c == ':') || (c == '\"') || (c == '\'') || (c == '[') || (c == ']') || (c == '<') || (c == '>') || (c == '(') || (c == ')')) return c; return (char) 0;
-
quoteAUC
public static char quoteAUC(char c)
Quotes - any version. AUC or normal-ASCII, (BOTH) single or double quote.- Parameters:
c
- Any character in the entire UniCode range.0x0000 to 0xFFFF
which is theBasic Multi Lingual Plane
.- Returns:
- If the input character
'c'
is an "AUC" version of the single (or double) quote, or the regular-ASCII single/double quote, then the appropriate single or double-quote is returned. Otherwise 0 is returned. - See Also:
punctuationAUC(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x22: return '\"'; // DEC: 34 case 0x27: return '\''; // DEC: 39 default: return (char) 0; }
-
commaAUC
public static char commaAUC(char c)
Comma - any version. AUC or normal-ASCII, (BOTH) comma- Parameters:
c
- Any character in the entire UTF-8 range.0x0000 to 0xFFFF
, theBasic Multi-Lingual Plane
.- Returns:
- If the input character
'c'
is an "AUC" version of the comma, or the regular-ASCII comma, then the comma is returned. Otherwise 0 is returned. - See Also:
punctuationAUC(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x2c: return ','; // DEC: 44 default: return (char) 0; }
-
bracketAUC
public static char bracketAUC(char c)
Brackets - any version. AUC or normal-ASCII, (BOTH) brackets- Parameters:
c
- Any character in the entirbrackets UniCode range. 0x0000 to 0xFFFF- Returns:
- If the input character
'c'
is an "AUC" version of the brackets, or the regular-ASCII brackets, then the appropriate brackets are returned. Otherwise 0 is returned. - See Also:
punctuationAUC(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x5B: return '['; // DEC: 91 case 0x5D: return ']'; // DEC: 93 case 0x3C: return '<'; // DEC: 60 case 0x3E: return '>'; // DEC: 62 default: return (char) 0; }
-
parenAUC
public static char parenAUC(char c)
Parenthesis - any version. AUC or normal-ASCII, (BOTH) parenthesis- Parameters:
c
- Any character in the entire UniCode range. 0x0000 to 0xFFFF- Returns:
- If the input character
'c'
is an "AUC" version of the parenthesis, or the regular-ASCII parenthesis, then the appropriate parenthesis are returned. Otherwise 0 is returned. - See Also:
punctuationAUC(char)
- Code:
- Exact Method Body:
char auc = punctuationAUC(c); if (auc != 0) c = auc; switch ((int) c) { case 0x28: return '('; // DEC: 40 case 0x29: return ')'; // DEC: 41 default: return (char) 0; }
-
testAUC
public static java.lang.String testAUC()
- Returns:
- An HTML <TABLE> that contains many tests of the subroutines in this class
- Code:
- Exact Method Body:
StringBuilder ret = new StringBuilder(); ret.append( "<TABLE BORDER=\"1\"><TR>" + "<TD WIDTH=\"30\"> </TD>" + "<TD WIDTH=\"70\"> </TD>" + "<TD WIDTH=\"70\"> </TD>" + "<TD WIDTH=\"30\"> </TD>" ); for (int i=4; i < 12; i++) ret.append("<TD WIDTH=\"70\"> </TD>"); ret.append("</TR>");; for (int i=0; i < AUC.length(); i++) { char c = AUC.charAt(i); if (c == ' ') continue; // Check original character (not punctuation-converted cc) char bl = Integer.toString(bulletListAUC(c)).charAt(0); boolean bpmf = isBPMFAUC(c); // first, convert the punctuation to normal-ASCII punctuation // These are the "translated" characters // The "translated character" is where, for example '〗' ==> ']' char newC = punctuationAUC(c); // These are used for building <TABLE> & <TD> entry strings char q = quoteAUC(newC); char es = endOfSentenceAUC(newC); char ep = endOfPhraseAUC(newC); char com = commaAUC(newC); char br = bracketAUC(newC); char p = parenAUC(newC); char ascii = punctuationAUC(c); if (ascii == 0) ascii = alphaNumericAUC(c); if (bl != 0) ascii = bl; if (bpmf) ascii = c; if (ascii == 0) ascii = 'x'; // ================================================= // This is for debugging this test function String tmp = " newCC = " + newC + ", q=" + q + ", es=" + es + ", ep=" + ep + ", com=" + com + ", br=" + br + ", p=" + p + ", bl =" + bl + ", bpmf=" + bpmf; tmp = tmp.replaceAll("<", "<").replaceAll(">", ">"); // Build the HTML Table ret.append("<TR>"); ret.append("<TD>" + c + "</TD>"); ret.append("<TD>" + ((int) c) + "</TD>"); ret.append("<TD>" + "0x" + String.format("%x",(int) c).toUpperCase() + "</TD>"); ret.append("<TD>" + ascii + "</TD>"); ret.append("<TD>" + ((q == 0) ? "" : "Quote") + "</TD>"); ret.append("<TD>" + ((es == 0) ? "" : "Sentence") + "</TD>"); ret.append("<TD>" + ((ep == 0) ? "" : "Phrase") + "</TD>"); ret.append("<TD>" + ((com == 0) ? "" : "Comma") + "</TD>"); ret.append("<TD>" + ((br == 0) ? "" : "Bracket") + "</TD>"); ret.append("<TD>" + ((p == 0) ? "" : "Paren") + "</TD>"); ret.append("<TD>" + ((bl == 0) ? "" : "Bullet") + "</TD>"); ret.append("<TD>" + (bpmf ? "BPMF" : "") + "</TD>"); // ========================================================== // Un-Comment this if you want to debug this print function // outStr += "</TR><TR><TD COLSPAN=\"12\">" + tmp + "</TD></TR>"; } ret.append("</TABLE>"); return ret.toString();
-
countLeadingLettersAndNumbers
public static int countLeadingLettersAndNumbers (java.lang.String chineseSentence)
Checks for any leading alphabetic('a' ... 'z')
and numeric('0' ... '9')
characters in a ChineseString
. CHANGED: 2018.09.24 - I left comma's and period's in theString
(when situated between digits). These are considered to be part of the "Leading Letters and Numbers"- Parameters:
chineseSentence
- A sentence that may or may not have leading letters & numbers.- Returns:
- the
String
-index of the first non-alphabetic, non-numeric character in theString
.
NOTE: white-space does not count, and the position of the first white-space character will be returned, if white-space is contained in thisString
. - See Also:
isAlphaNumeric(char)
- Code:
- Exact Method Body:
for (int i = 0; i < chineseSentence.length(); i++) { char c = chineseSentence.charAt(i); if ((! isAlphaNumeric(c)) && (c != '.') && (c != ',')) return i; } return chineseSentence.length(); // This really ought not to happen, but just in case....
-
convertAnyAUC
public static java.lang.String convertAnyAUC(java.lang.String s)
Checks for higher-Unicode letters and numbers, and converts them into lower-level versions of the appropriate letter or number.
SPECIFICALLY: This method is just a "for-loop" which makes a call toalphaNumericaAUC()
and if zero is not returned from that method-call, then the inputString
is modified at the index which contained such a higherUTF-8
letter or number.- Parameters:
s
- This may or may not have "Alternate UniCode" Characters for letters and numbers.- Returns:
- if the "alternate" versions of
'A' ... 'Z'
or'0' ... '9'
are there, this will make sure to change them. - See Also:
alphaNumericAUC(char)
- Code:
- Exact Method Body:
char[] cArr = s.toCharArray(); for (int i = 0; i < cArr.length; i++) { char auc = alphaNumericAUC(cArr[i]); if (auc != 0) cArr[i] = auc; } return new String(cArr);
-
countSyllablesAndNonChinese
public static int countSyllablesAndNonChinese(java.lang.String word, java.lang.Appendable DOUT) throws java.io.IOException
Counts syllables in a "word" of PinYin. The inputString
is expected to not have any spaces!
NOTE:The number of syllables in a Chinese PinYin "word" identifies the number of Chinese Characters that were used to generate the input PinYinString
.
CHANGED: 2018.09.24 - Added a test for periods and commas that are situated directly between two digits. In the String "5.0" the period between 5 and 0 is no longer removed!
If theString
"5.0" were passed as the "word" parameter, the result should be 3!- Parameters:
word
- A word in the "PinYin" format. (罗马拼音)DOUT
- This must implementjava.lang.Appendable
- Returns:
- the number of syllables (specifically: Chinese Characters) in the input word.
- Throws:
java.io.IOException
- The interfacejava.lang.Appendable
mandates that theIOException
must be treated as a checked exception for all output operations. ThereforeIOException
is a required exception in this method' throws clause.- Code:
- Exact Method Body:
int numChinese = 0; // Tone-Vowels & Numbers always correspond to a character for (int letter = 0; letter < word.length(); letter++) { char c = word.charAt(letter); if ( ZH.isToneVowel(c) || ZH.isNumber(c) || (c == '.') || (c == ',') ) numChinese++; } // Checks for vowel-strings that don't contain a tone // ==> Checks for "clear tone" String copyW = "" + word; DOUT.append("[" + copyW + "] - "); for (int letterIndex = 0; letterIndex < copyW.length(); letterIndex++) if ( ! ZH.isRegVowel(copyW.charAt(letterIndex)) && ! ZH.isToneVowel(copyW.charAt(letterIndex)) ) copyW = StringParse.setChar(copyW, letterIndex, ' '); DOUT.append("after erasing non-vowels [" + copyW + "]\n"); String[] syllables = copyW.trim().split(" "); DOUT.append("Syllables are:"); for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++ ) DOUT.append("[" + syllables[sylIndex] + "]"); DOUT.append("\n"); TOP: for (int sylIndex = 0; sylIndex < syllables.length; sylIndex++) { String syllable = syllables[sylIndex].trim(); boolean foundTone = false; // The split(' ') function sometimes provides blanks if (syllable.length() == 0) continue TOP; for (int vowelIndex = 0; vowelIndex < syllable.length(); vowelIndex++) if (ZH.isToneVowel(syllable.charAt(vowelIndex))) continue TOP; numChinese++; DOUT.append("NOTE: *** FOUND CLEAR TONE\n"); } return numChinese;
-
delAllPunctuationCHINESE
public static java.lang.String delAllPunctuationCHINESE (java.lang.String s)
Deletes all punctuation & non-character symbols. TheString
that is returned will be shortened by precisely the number of punctuation characters were contained by thatString
.
NOTE:'.'
and','
(periods and commas) between number/digits are not removed!- Parameters:
s
- An inputString
(in Mandarin - 普通话)- Returns:
- a
String
that is the same as the inputString
- after skipping characters as follows:
if (isChinese(c) || isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue; (else) s = StringParse.delChar(s, chr--);
- Code:
- Exact Method Body:
char[] cArr = s.toCharArray(); int sourcePos = 0; int destPos = 0; while (sourcePos < cArr.length) { char c = cArr[sourcePos]; // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's situated // directly between 2 numbers. if ( ((c == '.') || (c == ',')) && (((sourcePos-1) == -1) || isNumber(cArr[sourcePos-1])) && (((sourcePos+1) == s.length()) || isNumber(cArr[sourcePos+1])) ) { cArr[destPos++] = cArr[sourcePos++]; continue; } // AUC were converted before calling this function ... (alphaNumericAUC(c) != 0)) if (isChinese(c) || isAlphaNumeric(c)) { cArr[destPos++] = cArr[sourcePos++]; continue; } sourcePos++; } return s;
-
delAllPunctuationPINYIN
public static java.lang.String delAllPunctuationPINYIN(java.lang.String s)
Deletes all punctuation & non-character symbols from aString
of PinYin. The returnedString
will have the same length as it originally did, but the locations where punctuation existed will have been replaced with a space character.
NOTE:'.'
and','
(periods and commas) between number/digits are not removed!- Parameters:
s
- An inputString
in 罗马拼音- Returns:
- A
String
that is the same as the inputString
- after skipping characters as follows:
if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue; (else) s = StringParse.setChar(s, chr, ' ');
- Code:
- Exact Method Body:
char[] cArr = s.toCharArray(); // This loop cnverts all non-AlphaNumeric unicode to a space for (int i = 0; i < cArr.length; i++) { char c = cArr[i]; if (isAlphaNumeric(c) || (alphaNumericAUC(c) != 0)) continue; // Check for things like 5.0 or 1,120,987 - SPECIFICALLY Comma's and Period's // situated directly between 2 numbers. if ( ((c == '.') || (c == ',')) && (((i-1) == -1) || isNumber(cArr[i-1])) && (((i+1) == s.length()) || isNumber(cArr[i+1])) ) continue; cArr[i] = ' '; } return new String(cArr);
-
GTPPEIndexOf
public static int GTPPEIndexOf(java.lang.String s, char c)
GTPPE: Google Translate Punctuation Pronunciation Equivalent This searches through aString
to find the location of the "equivalent punctuation mark"- Parameters:
s
- The inputString
, expected to be the result of a GCS TS query. This function is totally useless for anyPronunciation String
that hasn't been obtained from GCS TS.
NOTE: The inputString
is intended to be in "PinYin" (罗马拼音)c
- The original punctuation character to look for... Generally, this is used to search for higher-level UTF-8chars
that have been "down-converted" by GCS TS- Returns:
- the
indexOf()
of the character in the original input String. The actual character is not looked for, BUT RATHER, the Google Cloud Server Transation Services equivalent character. Specifically,GCSTS
has a "substitute punctuation" for many higher-level UTF-8 and UniCode chars. There are 5 different versions of a quote... - Code:
- Exact Method Body:
int cc = (int) c; // if (c == '∶') return s.indexOf(c); if (cc == 0x2236) return s.indexOf(c); // if (c == ':') return s.indexOf(':'); if (cc == 0xFF1A) return s.indexOf(':'); // (0x003A); // if (c == ':') return s.indexOf(c); // Natural colon if (cc == 0x003A) return s.indexOf(c); // commas // if (c == '、') return s.indexOf(','); if (cc == 0x3001) return s.indexOf(','); // (0x002C); // if (c == ',') return s.indexOf(','); if (cc == 0xFF0C) return s.indexOf(','); // (0x002C); // if (c == ',') return s.indexOf(c); // natural comma if (cc == 0x002C) return s.indexOf(c); // periods // if (c == '。') return s.indexOf('.'); if (cc == 0x3002) return s.indexOf('.'); // (0x002E); // if (c == '○') return s.indexOf(c); if (cc == 0x25CB) return s.indexOf(c); // if (c == '●') return s.indexOf(c); if (cc == 0x25CF) return s.indexOf(c); // if (c == '.') return s.indexOf('.'); if (cc == 0xFF0E) return s.indexOf('.'); // (0x002E); // if (c == '.') return s.indexOf(c); // natural period if (cc == 0x002E) return s.indexOf(c); // Exclamation & Question // if (c == '?') return s.indexOf(c); // natural question-mark if (cc == 0x003F) return s.indexOf(c); // if (c == '?') return s.indexOf('?'); if (cc == 0xFF1F) return s.indexOf('?'); // (0x003F); // if (c == '!') return s.indexOf('!'); if (cc == 0xFF01) return s.indexOf('!'); // (0x0021); // if (c == '!') return s.indexOf(c); // natural exclamation if (cc == 0x0021) return s.indexOf(c); // single-quotes // if (c == '‘') return s.indexOf(c); if (cc == 0x2018) return s.indexOf(c); // if (c == '’') return s.indexOf(c); if (cc == 0x2019) return s.indexOf(c); // if (c == '′') return s.indexOf(c); if (cc == 0x2032) return s.indexOf(c); // if (c == ''') return s.indexOf('\''); if (cc == 0xFF07) return s.indexOf('\''); // (0x0027); // if (c == '`') return s.indexOf('`'); if (cc == 0xFF40) return s.indexOf('`'); // (0x0060); // if (c == '\'') return s.indexOf(c); // natural single-quotes if (cc == 0x0027) return s.indexOf(c); // NOT DETECTED RIGHT NOW.. // if (c == '《') return s.indexOf('“'); if (cc == 0x300A) return s.indexOf(CONSTSpecialQuoteLeft); // if (c == '》') return s.indexOf('”'); if (cc == 0x300B) return s.indexOf(CONSTSpecialQuoteRight); // double-quotes // if (c == '¨') return s.indexOf(c); if (cc == 0x00A8) return s.indexOf(c); // if (c == '〃') return s.indexOf(c); if (cc == 0x3003) return s.indexOf(c); // if (c == '“') return s.indexOf(c); if (cc == 0x201C) return s.indexOf(c); // if (c == '”') return s.indexOf(c); if (cc == 0x201D) return s.indexOf(c); // if (c == '″') return s.indexOf(c); if (cc == 0x2033) return s.indexOf(c); // if (c == '"') return s.indexOf('\"'); if (cc == 0xFF02) return s.indexOf('\"'); // (0x0022); // if (c == '\"') return s.indexOf(c); // natural double quotes if (cc == 0x0022) return s.indexOf(c); // Brackets // if (c == '[') return s.indexOf(c); if (cc == 0x005B) return s.indexOf(c); // if (c == ']') return s.indexOf(c); if (cc == 0x005D) return s.indexOf(c); // if (c == '[') return s.indexOf('['); if (cc == 0xFF3B) return s.indexOf('['); // (0x005B); // if (c == ']') return s.indexOf(']'); if (cc == 0xFF3D) return s.indexOf(']'); // (0x005D); // if (c == '【') return s.indexOf('['); if (cc == 0x3010) return s.indexOf('['); // (0x005B); // if (c == '】') return s.indexOf(']'); if (cc == 0x3011) return s.indexOf(']'); // (0x005D); // if (c == '〖') return s.indexOf(c); if (cc == 0x3016) return s.indexOf(c); // if (c == '〗') return s.indexOf(c); if (cc == 0x3017) return s.indexOf(c); // if (c == '『') return s.indexOf('“'); if (cc == 0x300E) return s.indexOf(CONSTSpecialQuoteLeft); // if (c == '』') return s.indexOf('”'); if (cc == 0x300F) return s.indexOf(CONSTSpecialQuoteRight); // if (c == '「') return s.indexOf('`'); if (cc == 0x300C) return s.indexOf('`'); // (0x0060); // if (c == '」') return s.indexOf('\''); if (cc == 0x300D) return s.indexOf('\''); // (0x0027); // Parenthesis // if (c == '(') return s.indexOf(c); if (cc == 0x0028) return s.indexOf(c); // if (c == ')') return s.indexOf(c); if (cc == 0x0029) return s.indexOf(c); // if (c == '(') return s.indexOf('('); if (cc == 0xFF08) return s.indexOf('('); // (0x0028); // if (c == ')') return s.indexOf(')'); if (cc == 0xFF09) return s.indexOf(')'); // (0x0029); // if (c == '〔') return s.indexOf(c); if (cc == 0x3014) return s.indexOf(c); // if (c == '〕') return s.indexOf(c); if (cc == 0x3015) return s.indexOf(c); System.out.println("character not found: \'" + c + "\'\nZH.GTPPEIndexOf(String s, char c)"); System.exit(0); return 0;
-
-