Package Torello.Languages
Class ES
- java.lang.Object
-
- Torello.Languages.ES
-
public class ES extends java.lang.Object
Some simple String Utilities for helping parse (Español) SpanishString's
.
This class provides some simple helper routines for working with Spanish language special characters. It deals particularly with accented vowels.
Hi-Lited Source-Code:- View Here: Torello/Languages/ES.java
- Open New Browser-Tab: Torello/Languages/ES.java
File Size: 21,627 Bytes Line Count: 544 '\n' Characters Found
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
convertHTML_TO_UTF8(String s)
This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package.static char
getAccentedVowel(char vowel, int flags)
This is intended to produce an accented vowel 'on request' from the method invocation.static boolean
isLanguageChar(char c)
Checks if this character could be a Spanish Language Characterstatic boolean
isSpanishVerbInfinitive(String s)
This is a function which identifies Spanish Language Infinitive Form Verbs.static boolean
onlyLanguageChars(String s)
Checks if aString
contains non-Spanish-Language Characters.static String
removeWords(String s)
This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList"Vector<String>
static void
setRemoveWordsArr(String[] wordList)
This just stores a list of "words", and they are removed from certain texts/articles.static char
toLowerCaseSpanish(char c)
Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character.static String
toLowerCaseSpanish(String s)
This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns aString
n which all characters are lower-case, but have their punctuation preserved.static char
toNonAccented(char c, boolean preserveCase)
This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent.static String
toNonAccented(String s, boolean preserveCase)
Removes Spanish-Accent Characters from all characters in a string.static char
toUpperCaseSpanish(char c)
Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character.static String
toUpperCaseSpanish(String s)
This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.
-
-
-
Field Detail
-
GRAVE
public static final int GRAVE
GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final int GRAVE = 0b0001;
-
UPPERCASE
public static final int UPPERCASE
UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
public static final int UPPERCASE = 0b0010;
-
-
Method Detail
-
getAccentedVowel
public static char getAccentedVowel(char vowel, int flags)
This is intended to produce an accented vowel 'on request' from the method invocation. The complete list of characters that may be returned by this function are listed below.Upper, Grave Upper, Acute Lower, Grave Lower, Acute À (192) Á (193) à (224) á (225) È (200) É (201) è (232) é (233) Ì (204) Í (205) ì (236) í (237) Ò (210) Ó (211) ò (242) ó (243) Ù (217) Ú (218) ù (249) ú (250)
- Parameters:
vowel
- Any vowel: [A, E, I, O, U] or [a, e, i, o, u]
If 'vowel' is not one of these 10 choices, then other characters will be ignored, and this method will just return (char) 0.flags
- The following values can be OR'D (masked): Helper.GRAVE or Helper.UPPERCASE
In total, there are 4 possible versions: Upper-Case/Lower-Case output, and Accute/Grave output.- If Helper.GRAVE is not masked (binary-bit 0), then an "accute" accented vowel is returned (accute is "the default").
- If Helper.UPPERCASE is not masked (binary-bit 1), then a lower-case vowel is returned (lower-case is "the default").
- Returns:
- With correct input: one of ten listed vowels above - and if not, then ASCII 0 is returned.
- Code:
- Exact Method Body:
int i = 0; if ((vowel == 'a') || (vowel == 'A')) i = 192; else if ((vowel == 'e') || (vowel == 'E')) i = 200; else if ((vowel == 'i') || (vowel == 'I')) i = 204; else if ((vowel == 'o') || (vowel == 'O')) i = 210; else if ((vowel == 'u') || (vowel == 'U')) i = 217; else return (char) 0; // À (192)È (200)Ì (204)Ò (210)Ù (217) if ( ((flags & UPPERCASE) > 0) && ((flags & GRAVE) > 0) ) return (char) (i + 0); // Á (193)É (201)Í (205)Ó (211)Ú (218) else if ((flags & UPPERCASE) > 0) return (char) (i + 1); // à (224)è (232)ì (236)ò (242)ù (249) else if ((flags & GRAVE) > 0) return (char) (i + 32); // á (225)é (233)í (237)ó (243)ú (250) else return (char) (i + 33);
-
toNonAccented
public static char toNonAccented(char c, boolean preserveCase)
This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent. Also, upper-case regular characters are down-cased. If specifically requested, case can be preserved.A (65) ... Z (90) ⇒ a .. z À (192), Á (193), à (224), á (225) ⇒ A or a È (200), É (201), è (232), é (233) ⇒ E or e Ì (204), Í (205), ì (236), í (237) ⇒ I or i Ò (210), Ó (211), ò (242), ó (243) ⇒ O or o Ù (217), Ú (218), ù (249), ú (250) ⇒ U or u Ñ (209), ñ (241) ⇒ N or n Ü (220), ü (252) ⇒ U or u Ý (221), ý (253) ⇒ Y or y - Parameters:
c
- Any ASCII/UniCode characterpreserveCase
- If this is TRUE, then accented capital letters remain capitlized. If this is FALSE, then all letters are converted to lowercase.- Returns:
- If this character contained an accent, it will be removed. It will also be in lower-case form, unless preserveCase is TRUE.
- Code:
- Exact Method Body:
if ((c == 224) || (c == 225)) return 'a'; if ((c == 232) || (c == 233)) return 'e'; if ((c == 236) || (c == 237)) return 'i'; if ((c == 242) || (c == 243)) return 'o'; if ((c == 249) || (c == 250)) return 'u'; if (c == 241) return 'n'; if (c == 252) return 'u'; if (c == 253) return 'y'; if ((c == 192) || (c == 193)) return (preserveCase ? 'A' : 'a'); if ((c == 200) || (c == 201)) return (preserveCase ? 'E' : 'e'); if ((c == 204) || (c == 205)) return (preserveCase ? 'I' : 'i'); if ((c == 210) || (c == 211)) return (preserveCase ? 'O' : 'o'); if ((c == 217) || (c == 218)) return (preserveCase ? 'U' : 'u'); if (c == 209) return (preserveCase ? 'N' : 'n'); if (c == 220) return (preserveCase ? 'U' : 'u'); if (c == 221) return (preserveCase ? 'Y' : 'y'); if ((c >= 'A') && (c <= 'Z')) return (char) (preserveCase ? c : (c -'A' + 'a')); return c;
-
toNonAccented
public static java.lang.String toNonAccented(java.lang.String s, boolean preserveCase)
Removes Spanish-Accent Characters from all characters in a string.- Returns:
- a new String, one where toNonAccented(s.charAt(i), preserveCase) has been called for each character in the String. This is just a small for-loop over a String.
- See Also:
toNonAccented(char, boolean)
- Code:
- Exact Method Body:
StringBuilder sb = new StringBuilder(); int len = s.length(); for (int i=0; i < len; i++) sb.append(toNonAccented(s.charAt(i), preserveCase)); return sb.toString();
-
toLowerCaseSpanish
public static char toLowerCaseSpanish(char c)
Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character. This is almost identifical to the usual String function toLowerCase(char), but it also includes Spanish vowels and consonants with:- accent marks: À, Á, à, and á ... etc.
- umlaut's: Ü and ü
- tildes: Ñ and ñ
NOTE: The 'accute' and 'grave' accent marks are not so prevalently used anymore as in the time of "Don Quijote de la Mancha" - however, they are included here, just in case. Mostly the 'acute' accent mark (from top-right-corner to the lower-left-corner) is used in newspapers around here (Dallas, Texas).- Parameters:
c
- Any ASCII or UniCodechar
- Returns:
- Uppercase letters 'A' .. 'Z' are converted to 'a' .. 'z'
AND:À (192), Á (193) ⇒ à (224), á (225) È (200), É (201) ⇒ è (232), é (233) Ì (204), Í (205) ⇒ ì (236), í (237) Ò (210), Ó (211) ⇒ ò (242), ó (243) Ù (217), Ú (218) ⇒ ù (249), ú (250) Ñ (209) ⇒ ñ (241) Ý (221) ⇒ ý (253) Ü (220) ⇒ ü (252) - See Also:
toUpperCaseSpanish(char)
,toLowerCaseSpanish(String)
- Code:
- Exact Method Body:
if ((c >= 'A') && (c <= 'Z')) return (char) (c + 'a' - 'A'); else if ( (c == 192) || (c == 193) || (c == 200) || (c == 201) || (c == 204) || (c == 205) || (c == 210) || (c == 211) || (c == 217) || (c == 218) || (c == 209) || (c == 220) || (c == 221) ) return (char) (c + 32); return c;
-
toLowerCaseSpanish
public static java.lang.String toLowerCaseSpanish(java.lang.String s)
This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns aString
n which all characters are lower-case, but have their punctuation preserved.- Returns:
- a new string in which Helper.toLowerCaseSpanish(char) has been invoked on each character.
- See Also:
toLowerCaseSpanish(char)
- Code:
- Exact Method Body:
StringBuilder ret = new StringBuilder(); for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i))); return ret.toString();
-
toUpperCaseSpanish
public static char toUpperCaseSpanish(char c)
Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character. See toLowerCaseSpanish(char) for more notes!- Parameters:
c
- Any ASCII or UniCode char- Returns:
- Lowercase letters
'a' .. 'z'
are converted to'A' .. 'Z'
AND:à (224), á (225) ⇒ À (192), Á (193) è (232), é (233) ⇒ È (200), É (201) ì (236), í (237) ⇒ Ì (204), Í (205) ò (242), ó (243) ⇒ Ò (210), Ó (211) ù (249), ú (250) ⇒ Ù (217), Ú (218) ñ (241) ⇒ Ñ (209) ý (253) ⇒ Ý (221) ü (252) ⇒ Ü (220) - See Also:
toLowerCaseSpanish(char)
,toUpperCaseSpanish(String)
- Code:
- Exact Method Body:
if ((c >= 'a') && (c <= 'z')) return (char) (c + 'A' - 'a'); else if ( (c == 224) || (c == 225) || (c == 232) || (c == 233) || (c == 236) || (c == 237) || (c == 242) || (c == 243) || (c == 249) || (c == 250) || (c == 241) || (c == 253) || (c == 252) ) return (char) (c - 32); return c;
-
toUpperCaseSpanish
public static java.lang.String toUpperCaseSpanish(java.lang.String s)
This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.- Returns:
- a new string in which Helper.toUpperCaseSpanish(char) has been invoked on each character.
- See Also:
toUpperCaseSpanish(char)
- Code:
- Exact Method Body:
StringBuilder ret = new StringBuilder(); for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i))); return ret.toString();
-
isLanguageChar
public static boolean isLanguageChar(char c)
Checks if this character could be a Spanish Language Character- Parameters:
c
- Any ASCII or Uni-Code Character- Returns:
- TRUE: If and only if 'c' is one of the following char-sets:
- a ... z
- A ... Z
- Á (193), É (201), Í (205), Ó (211), Ú (218), Ý (221), Ü (220), Ñ (209)
- á (225), é (233), í (237), ó (243), ú (250), ý (253), ü (252), ñ (241)
and FALSE otherwise... - Code:
- Exact Method Body:
if ((c >= 'a') && (c <= 'z')) return true; if ((c >= 'A') && (c <= 'Z')) return true; // Á 193, É 201, Í 205, Ó 211, Ú 218, Ý 221, Ü 220, Ñ 209 if ( (c == 193) || (c == 201) || (c == 205) || (c == 211) || (c == 218) || (c == 221) || (c == 220) || (c == 209)) return true; // á 225, é 233, í 237, ó 243, ú 250, ý 253, ü 252, ñ 241 if ( (c == 225) || (c == 233) || (c == 237) || (c == 243) || (c == 250) || (c == 253) || (c == 252) || (c == 241)) return true; return false;
-
onlyLanguageChars
public static boolean onlyLanguageChars(java.lang.String s)
Checks if aString
contains non-Spanish-Language Characters. UtilizesisLanguageChar(char)
- Parameters:
s
- AnyString
consisting of ASCII & UniCode Characters- Returns:
TRUE
only ifisLanguageChar(s.charAt(i))
returnsTRUE
for ever integeri
, and FALSE otherwise.- See Also:
isLanguageChar(char)
- Code:
- Exact Method Body:
for (int i=0; i < s.length(); i++) if (! isLanguageChar(s.charAt(i))) return false; return true;
-
isSpanishVerbInfinitive
public static boolean isSpanishVerbInfinitive(java.lang.String s)
This is a function which identifies Spanish Language Infinitive Form Verbs.- Parameters:
s
- Any String consisting of ASCII & UniCode Characters- Returns:
TRUE
if and only if:
input-parameter's'
ends with: ar, er, ir, arse, erse, irse, ír, írse's'
passes theonlyLanguageChars(String)
boolean test
FALSE otherwise- See Also:
onlyLanguageChars(String)
- Code:
- Exact Method Body:
s = toLowerCaseSpanish(s); if (onlyLanguageChars(s)) if ( s.endsWith("ar") || s.endsWith("er") || s.endsWith("ir") || s.endsWith("arse") || s.endsWith("erse") || s.endsWith("irse") || s.endsWith("ír") || s.endsWith("írse")) return true; return false;
-
convertHTML_TO_UTF8
public static java.lang.String convertHTML_TO_UTF8(java.lang.String s)
This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package. There is a link provided to these methods at the end of this comment. This method was written much earlier, and functions well, but it can only convert HTML-Escape-Sequences that are used in Spanish - rather than all HTML-Character Escape-Sequences. Here is the complete list:á ⇒ á é ⇒ é í ⇒ í ó ⇒ ó ú ⇒ ú Á ⇒ Á É ⇒ É Í ⇒ Í Ó ⇒ Ó Ú ⇒ Ú ñ ⇒ ñ « ⇒ « » ⇒ » — ⇒ - ü ⇒ ü ï ⇒ ï ¡ ⇒ ¡ ¿ ⇒ ¿ " ⇒ " - Parameters:
s
- Any ASCII/UniCode String, which ostensibly ought to (possibly) contain Spanish-Language HTML-Escaped characters within them.- Returns:
- A string where all HTML escape-sequences have been converted to their actual character equivalent.
- See Also:
Escape.escHTMLToChar(String)
,Escape.htmlEsc(char)
,StrReplace.r(String, String[], char[])
- Code:
- Exact Method Body:
return StrReplace.r(s, ESC_STRS, REPL_CHARS);
-
setRemoveWordsArr
public static void setRemoveWordsArr(java.lang.String[] wordList)
This just stores a list of "words", and they are removed from certain texts/articles. This program currently uses it to remove certain extremely commonly used words, so they are not repeatedly searched for in the dictionary. It is kind of a hack.- Parameters:
wordList
- An array of Strings. It is expected to be a list of words that may be removed from Spanish Texts, but it can be any list of words. It is checked to see if 100% of the characters in each word are alphabetic, and throws an IllegalArgumentException if they are not.- Throws:
java.lang.IllegalArgumentException
- if the wordList parameter contains strings with invalid non-word characters.- Code:
- Exact Method Body:
removeList = new Vector<String>(); for (int i=0; i < wordList.length; i++) { String word = wordList[i]; for (int j=0; j < word.length(); j++) if (! isLanguageChar(word.charAt(j))) throw new IllegalArgumentException( "Contains word:" + word + " which has invalid, non-word, language-characters"); removeList.addElement(word); }
-
removeWords
public static java.lang.String removeWords(java.lang.String s)
This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList"Vector<String>
- Parameters:
s
- A String of Spanish Words.- Returns:
- The same string with each instance of each word that is listed in the "removeList"
Vector
removed from theString
- See Also:
setRemoveWordsArr(String[])
- Code:
- Exact Method Body:
// boolean printIt = false; // int tpos = s.indexOf(" a "); // if (tpos != -1) if (s.indexOf(" a ", tpos + 3) != -1) printIt = true; // if (printIt) System.out.println(s + ":"); Enumeration<String> e = removeList.elements(); // System.out.println("CLEANING: [" + s + "]"); while (e.hasMoreElements()) { String lc = toLowerCaseSpanish(s); // System.out.print(" <" + lc + ">"); String word = e.nextElement(); // System.out.print(" {" + word + "}"); int pos = 0; while ((pos = lc.indexOf(word, pos)) != -1) { int startPos = pos; int endPos = pos + word.length(); boolean leftEnd = (startPos == 0); boolean rightEnd = (endPos == lc.length()); char leftChar = leftEnd ? 0 : lc.charAt(startPos - 1); char rightChar = rightEnd ? 0 : lc.charAt(endPos); // if (printIt) System.out.print("(" + leftChar + "," + rightChar + "," + leftEnd + // "," + rightEnd + "," + startPos + "," + endPos + ") "); if (isLanguageChar(leftChar)) { pos = endPos; continue; } if (isLanguageChar(rightChar)) { pos = endPos; continue; } // System.out.print("(" + startPos + "," + endPos + ")" ); boolean leftSpace = (leftChar == ' '); boolean rightSpace = (rightChar == ' '); if (leftSpace && rightSpace) startPos--; else if (leftSpace && rightEnd) startPos--; else if (leftEnd && rightSpace) endPos++; s = (leftEnd ? "" : s.substring(0, startPos)) + (rightEnd ? "" : s.substring(endPos)); // if (printIt) System.out.print("[" + s + "] "); lc = toLowerCaseSpanish(s); } } // if (printIt) System.out.println("\n"); return s;
-
-