java.lang.Object
- Torello.Languages.ES

```
public class ES
extends java.lang.Object
```
Some simple String Utilities for helping parse (Español) Spanish String's.

This class provides some simple helper routines for working with Spanish language special characters. It deals particularly with accented vowels.
Hi-Lited Source-Code:
- View Here: Torello/Languages/ES.java
- Open New Browser-Tab: Torello/Languages/ES.java
File Size: 21,627 Bytes Line Count: 544 '\n' Characters Found

Field Summary

Fields
Modifier and Type	Field	Description
`static int`	`GRAVE`	GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE
`static int`	`UPPERCASE`	UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static String`	`convertHTML_TO_UTF8(String s)`	This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package.
`static char`	`getAccentedVowel(char vowel, int flags)`	This is intended to produce an accented vowel 'on request' from the method invocation.
`static boolean`	`isLanguageChar(char c)`	Checks if this character could be a Spanish Language Character
`static boolean`	`isSpanishVerbInfinitive(String s)`	This is a function which identifies Spanish Language Infinitive Form Verbs.
`static boolean`	`onlyLanguageChars(String s)`	Checks if a `String` contains non-Spanish-Language Characters.
`static String`	`removeWords(String s)`	This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList" `Vector<String>`
`static void`	`setRemoveWordsArr(String[] wordList)`	This just stores a list of "words", and they are removed from certain texts/articles.
`static char`	`toLowerCaseSpanish(char c)`	Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character.
`static String`	`toLowerCaseSpanish(String s)`	This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns a `String` n which all characters are lower-case, but have their punctuation preserved.
`static char`	`toNonAccented(char c, boolean preserveCase)`	This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent.
`static String`	`toNonAccented(String s, boolean preserveCase)`	Removes Spanish-Accent Characters from all characters in a string.
`static char`	`toUpperCaseSpanish(char c)`	Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character.
`static String`	`toUpperCaseSpanish(String s)`	This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- GRAVE
  
  🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static final int GRAVE
```
  GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE
  See Also:
  
  Constant Field Values
  
  Code:
  
  Exact Field Declaration Expression:
  
  public static final int GRAVE = 0b0001;
- UPPERCASE
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static final int UPPERCASE
```
  UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE
  See Also:
  
  Constant Field Values
  
  Code:
  
  Exact Field Declaration Expression:
  
  public static final int UPPERCASE = 0b0010;

Method Detail

getAccentedVowel

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char getAccentedVowel(char vowel,
                                    int flags)

This is intended to produce an accented vowel 'on request' from the method invocation. The complete list of characters that may be returned by this function are listed below.

Upper, Grave	Upper, Acute	Lower, Grave	Lower, Acute
À (192)	Á (193)	à (224)	á (225)
È (200)	É (201)	è (232)	é (233)
Ì (204)	Í (205)	ì (236)	í (237)
Ò (210)	Ó (211)	ò (242)	ó (243)
Ù (217)	Ú (218)	ù (249)	ú (250)

Parameters:

vowel - Any vowel: [A, E, I, O, U] or [a, e, i, o, u]

If 'vowel' is not one of these 10 choices, then other characters will be ignored, and this method will just return (char) 0.

flags - The following values can be OR'D (masked): Helper.GRAVE or Helper.UPPERCASE

In total, there are 4 possible versions: Upper-Case/Lower-Case output, and Accute/Grave output.

If Helper.GRAVE is not masked (binary-bit 0), then an "accute" accented vowel is returned (accute is "the default").
If Helper.UPPERCASE is not masked (binary-bit 1), then a lower-case vowel is returned (lower-case is "the default").

Returns:

With correct input: one of ten listed vowels above - and if not, then ASCII 0 is returned.

Code:

Exact Method Body:

 int i = 0;

 if		((vowel == 'a') || (vowel == 'A')) i = 192;
 else if	((vowel == 'e') || (vowel == 'E')) i = 200;
 else if ((vowel == 'i') || (vowel == 'I')) i = 204;
 else if ((vowel == 'o') || (vowel == 'O')) i = 210;
 else if ((vowel == 'u') || (vowel == 'U')) i = 217;
 else return (char) 0;

 // À (192)È (200)Ì (204)Ò (210)Ù (217)
 if (    ((flags & UPPERCASE) > 0)
     &&  ((flags & GRAVE) > 0)
 )
     return (char) (i + 0);

 // Á (193)É (201)Í (205)Ó (211)Ú (218)
 else if	((flags & UPPERCASE) > 0) return (char) (i + 1);

 // à (224)è (232)ì (236)ò (242)ù (249)
 else if ((flags & GRAVE) > 0) return (char) (i + 32);

 // á (225)é (233)í (237)ó (243)ú (250)
 else return (char) (i + 33);

toNonAccented

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char toNonAccented(char c,
                                 boolean preserveCase)

This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent. Also, upper-case regular characters are down-cased. If specifically requested, case can be preserved.

A (65) ... Z (90)	⇒ a .. z
À (192), Á (193), à (224), á (225)	⇒ A or a
È (200), É (201), è (232), é (233)	⇒ E or e
Ì (204), Í (205), ì (236), í (237)	⇒ I or i
Ò (210), Ó (211), ò (242), ó (243)	⇒ O or o
Ù (217), Ú (218), ù (249), ú (250)	⇒ U or u
Ñ (209), ñ (241)	⇒ N or n
Ü (220), ü (252)	⇒ U or u
Ý (221), ý (253)	⇒ Y or y

Parameters:

c - Any ASCII/UniCode character

preserveCase - If this is TRUE, then accented capital letters remain capitlized. If this is FALSE, then all letters are converted to lowercase.

Returns:

If this character contained an accent, it will be removed. It will also be in lower-case form, unless preserveCase is TRUE.

Code:

Exact Method Body:

 if ((c == 224) || (c == 225))   return 'a';
 if ((c == 232) || (c == 233))   return 'e';
 if ((c == 236) || (c == 237))   return 'i';
 if ((c == 242) || (c == 243))   return 'o';
 if ((c == 249) || (c == 250))   return 'u';
 if (c == 241)                   return 'n';
 if (c == 252)                   return 'u';
 if (c == 253)                   return 'y';

 if ((c == 192) || (c == 193))   return (preserveCase ? 'A' : 'a');
 if ((c == 200) || (c == 201))   return (preserveCase ? 'E' : 'e');
 if ((c == 204) || (c == 205))   return (preserveCase ? 'I' : 'i');
 if ((c == 210) || (c == 211))   return (preserveCase ? 'O' : 'o');
 if ((c == 217) || (c == 218))   return (preserveCase ? 'U' : 'u');
 if (c == 209)                   return (preserveCase ? 'N' : 'n');
 if (c == 220)                   return (preserveCase ? 'U' : 'u');
 if (c == 221)                   return (preserveCase ? 'Y' : 'y');

 if ((c >= 'A') && (c <= 'Z'))   return (char) (preserveCase ? c : (c -'A' + 'a'));

 return c;

toNonAccented

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String toNonAccented(java.lang.String s,
                                             boolean preserveCase)

Removes Spanish-Accent Characters from all characters in a string.

Returns:

a new String, one where toNonAccented(s.charAt(i), preserveCase) has been called for each character in the String. This is just a small for-loop over a String.

See Also:

toNonAccented(char, boolean)

Code:

Exact Method Body:

 StringBuilder   sb  = new StringBuilder();
 int             len = s.length();

 for (int i=0; i < len; i++) sb.append(toNonAccented(s.charAt(i), preserveCase));

 return sb.toString();

toLowerCaseSpanish

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char toLowerCaseSpanish(char c)

Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character. This is almost identifical to the usual String function toLowerCase(char), but it also includes Spanish vowels and consonants with:

accent marks: À, Á, à, and á ... etc.
umlaut's: Ü and ü
tildes: Ñ and ñ

NOTE: The 'accute' and 'grave' accent marks are not so prevalently used anymore as in the time of "Don Quijote de la Mancha" - however, they are included here, just in case. Mostly the 'acute' accent mark (from top-right-corner to the lower-left-corner) is used in newspapers around here (Dallas, Texas).

Parameters:

c - Any ASCII or UniCode char

Returns:

Uppercase letters 'A' .. 'Z' are converted to 'a' .. 'z'
AND:

À (192), Á (193)	⇒ à (224), á (225)
È (200), É (201)	⇒ è (232), é (233)
Ì (204), Í (205)	⇒ ì (236), í (237)
Ò (210), Ó (211)	⇒ ò (242), ó (243)
Ù (217), Ú (218)	⇒ ù (249), ú (250)
Ñ (209)	⇒ ñ (241)
Ý (221)	⇒ ý (253)
Ü (220)	⇒ ü (252)

Code:

Exact Method Body:

 if ((c >= 'A') && (c <= 'Z')) return (char) (c + 'a' - 'A');

 else if (
         (c == 192) || (c == 193) || (c == 200) || (c == 201)
     ||  (c == 204) || (c == 205) || (c == 210) || (c == 211)
     ||  (c == 217) || (c == 218) || (c == 209) || (c == 220)
     ||  (c == 221)
 )
     return (char) (c + 32);

 return c;

toLowerCaseSpanish

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toLowerCaseSpanish(java.lang.String s)
```
This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns a String n which all characters are lower-case, but have their punctuation preserved.
Returns:

a new string in which Helper.toLowerCaseSpanish(char) has been invoked on each character.

See Also:

toLowerCaseSpanish(char)

Code:
Exact Method Body:

StringBuilder ret = new StringBuilder(); for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i))); return ret.toString();

toUpperCaseSpanish

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static char toUpperCaseSpanish(char c)

Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character. See toLowerCaseSpanish(char) for more notes!

Parameters:

c - Any ASCII or UniCode char

Returns:

Lowercase letters 'a' .. 'z' are converted to 'A' .. 'Z'

AND:

à (224), á (225)	⇒ À (192), Á (193)
è (232), é (233)	⇒ È (200), É (201)
ì (236), í (237)	⇒ Ì (204), Í (205)
ò (242), ó (243)	⇒ Ò (210), Ó (211)
ù (249), ú (250)	⇒ Ù (217), Ú (218)
ñ (241)	⇒ Ñ (209)
ý (253)	⇒ Ý (221)
ü (252)	⇒ Ü (220)

Code:

Exact Method Body:

 if ((c >= 'a') && (c <= 'z'))
     return (char) (c + 'A' - 'a');

 else if (	(c == 224) || (c == 225) || (c == 232) || (c == 233)
         ||  (c == 236) || (c == 237) || (c == 242) || (c == 243)
         ||  (c == 249) || (c == 250) || (c == 241) || (c == 253)
         ||  (c == 252)
     )
     return (char) (c - 32);

 return c;

toUpperCaseSpanish

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toUpperCaseSpanish(java.lang.String s)
```
This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.
Returns:

a new string in which Helper.toUpperCaseSpanish(char) has been invoked on each character.

See Also:

toUpperCaseSpanish(char)

Code:
Exact Method Body:

StringBuilder ret = new StringBuilder(); for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i))); return ret.toString();

isLanguageChar

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static boolean isLanguageChar(char c)

Checks if this character could be a Spanish Language Character

Parameters:

c - Any ASCII or Uni-Code Character

Returns:

TRUE: If and only if 'c' is one of the following char-sets:

a ... z
A ... Z
Á (193), É (201), Í (205), Ó (211), Ú (218), Ý (221), Ü (220), Ñ (209)
á (225), é (233), í (237), ó (243), ú (250), ý (253), ü (252), ñ (241)

and FALSE otherwise...

Code:

Exact Method Body:

 if ((c >= 'a') && (c <= 'z')) return true;
 if ((c >= 'A') && (c <= 'Z')) return true;

 // Á 193, É 201, Í 205, Ó 211, Ú 218, Ý 221, Ü 220, Ñ 209
 if (    (c == 193) || (c == 201) || (c == 205) || (c == 211) || (c == 218) || (c == 221)
     ||  (c == 220) || (c == 209))
     return true;

 // á 225, é 233, í 237, ó 243, ú 250, ý 253, ü 252, ñ 241
 if (    (c == 225) || (c == 233) || (c == 237) || (c == 243) || (c == 250) || (c == 253)
      || (c == 252) || (c == 241))
     return true;

 return false;

onlyLanguageChars

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static boolean onlyLanguageChars(java.lang.String s)
```
Checks if a String contains non-Spanish-Language Characters. Utilizes isLanguageChar(char)
Parameters:

s - Any String consisting of ASCII & UniCode Characters

Returns:

TRUE only if isLanguageChar(s.charAt(i)) returns TRUE for ever integer i, and FALSE otherwise.

See Also:

isLanguageChar(char)

Code:
Exact Method Body:

for (int i=0; i < s.length(); i++) if (! isLanguageChar(s.charAt(i))) return false; return true;

isSpanishVerbInfinitive

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static boolean isSpanishVerbInfinitive(java.lang.String s)

This is a function which identifies Spanish Language Infinitive Form Verbs.

Parameters:

s - Any String consisting of ASCII & UniCode Characters

Returns:

TRUE if and only if:
input-parameter 's' ends with: ar, er, ir, arse, erse, irse, ír, írse
's' passes the onlyLanguageChars(String) boolean test
FALSE otherwise

See Also:

onlyLanguageChars(String)

Code:

Exact Method Body:

 s = toLowerCaseSpanish(s);

 if (onlyLanguageChars(s))
     if (    s.endsWith("ar")	|| s.endsWith("er")		|| s.endsWith("ir")
         ||  s.endsWith("arse")	|| s.endsWith("erse")	|| s.endsWith("irse")
         ||  s.endsWith("ír")	|| s.endsWith("írse"))
         return true;

 return false;

convertHTML_TO_UTF8

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String convertHTML_TO_UTF8(java.lang.String s)

This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package. There is a link provided to these methods at the end of this comment. This method was written much earlier, and functions well, but it can only convert HTML-Escape-Sequences that are used in Spanish - rather than all HTML-Character Escape-Sequences. Here is the complete list:

á	⇒ á
é	⇒ é
í	⇒ í
ó	⇒ ó
ú	⇒ ú
Á	⇒ Á
É	⇒ É
Í	⇒ Í
Ó	⇒ Ó
Ú	⇒ Ú
ñ	⇒ ñ
«	⇒ «
»	⇒ »
—	⇒ -
ü	⇒ ü
ï	⇒ ï
¡	⇒ ¡
¿	⇒ ¿
"	⇒ "

Parameters:

s - Any ASCII/UniCode String, which ostensibly ought to (possibly) contain Spanish-Language HTML-Escaped characters within them.

Returns:

A string where all HTML escape-sequences have been converted to their actual character equivalent.

Code:

Exact Method Body:

 return StrReplace.r(s, ESC_STRS, REPL_CHARS);

setRemoveWordsArr

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static void setRemoveWordsArr(java.lang.String[] wordList)
```
This just stores a list of "words", and they are removed from certain texts/articles. This program currently uses it to remove certain extremely commonly used words, so they are not repeatedly searched for in the dictionary. It is kind of a hack.
Parameters:

wordList - An array of Strings. It is expected to be a list of words that may be removed from Spanish Texts, but it can be any list of words. It is checked to see if 100% of the characters in each word are alphabetic, and throws an IllegalArgumentException if they are not.

Throws:

java.lang.IllegalArgumentException - if the wordList parameter contains strings with invalid non-word characters.

Code:
Exact Method Body:

removeList = new Vector<String>(); for (int i=0; i < wordList.length; i++) { String word = wordList[i]; for (int j=0; j < word.length(); j++) if (! isLanguageChar(word.charAt(j))) throw new IllegalArgumentException( "Contains word:" + word + " which has invalid, non-word, language-characters"); removeList.addElement(word); }

removeWords

🡅 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String removeWords(java.lang.String s)

This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList" Vector<String>

Parameters:

s - A String of Spanish Words.

Returns:

The same string with each instance of each word that is listed in the "removeList" Vector removed from the String

See Also:

setRemoveWordsArr(String[])

Code:

Exact Method Body:

 // boolean printIt = false;
 // int tpos = s.indexOf(" a ");
 // if (tpos != -1) if (s.indexOf(" a ", tpos + 3) != -1) printIt = true;
 // if (printIt) System.out.println(s + ":");
        
 Enumeration<String> e = removeList.elements();
 // System.out.println("CLEANING: [" + s + "]");

 while (e.hasMoreElements())
 {
     String lc = toLowerCaseSpanish(s);

     // System.out.print(" <" + lc + ">");
     String word = e.nextElement();

     // System.out.print(" {" + word + "}");
    
     int pos = 0;
     while ((pos = lc.indexOf(word, pos)) != -1)
     {
         int     startPos    = pos;
         int     endPos      = pos + word.length();
         boolean leftEnd     = (startPos == 0);
         boolean rightEnd    = (endPos == lc.length());
         char    leftChar    = leftEnd ? 0 : lc.charAt(startPos - 1);
         char    rightChar   = rightEnd ? 0 : lc.charAt(endPos);

         // if (printIt) System.out.print("(" + leftChar + "," + rightChar + "," + leftEnd +
         // "," + rightEnd + "," + startPos + "," + endPos + ") ");
    
         if (isLanguageChar(leftChar))   { pos = endPos; continue; }
         if (isLanguageChar(rightChar))  { pos = endPos; continue; }

         // System.out.print("(" + startPos + "," + endPos + ")" );
         boolean leftSpace = (leftChar == ' ');
         boolean rightSpace = (rightChar == ' ');

         if (leftSpace && rightSpace)    startPos--;
         else if (leftSpace && rightEnd) startPos--;
         else if (leftEnd && rightSpace) endPos++;
                
         s = (leftEnd ? "" : s.substring(0, startPos)) +
             (rightEnd ? "" : s.substring(endPos));

         // if (printIt) System.out.print("[" + s + "] ");
         lc = toLowerCaseSpanish(s);
     }
 }

 // if (printIt) System.out.println("\n");
 return s;

Class ES

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

GRAVE

UPPERCASE

Method Detail

getAccentedVowel

toNonAccented

toNonAccented

toLowerCaseSpanish

toLowerCaseSpanish

toUpperCaseSpanish

toUpperCaseSpanish

isLanguageChar

onlyLanguageChars

isSpanishVerbInfinitive

convertHTML_TO_UTF8

setRemoveWordsArr

removeWords