Class ES


  • public class ES
    extends java.lang.Object
    Some simple String Utilities for helping parse (Español) Spanish String's.

    This class provides some simple helper routines for working with Spanish language special characters. It deals particularly with accented vowels.


    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int GRAVE
      GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE
      static int UPPERCASE
      UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static String convertHTML_TO_UTF8​(String s)
      This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package.
      static char getAccentedVowel​(char vowel, int flags)
      This is intended to produce an accented vowel 'on request' from the method invocation.
      static boolean isLanguageChar​(char c)
      Checks if this character could be a Spanish Language Character
      static boolean isSpanishVerbInfinitive​(String s)
      This is a function which identifies Spanish Language Infinitive Form Verbs.
      static boolean onlyLanguageChars​(String s)
      Checks if a String contains non-Spanish-Language Characters.
      static String removeWords​(String s)
      This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList" Vector<String>
      static void setRemoveWordsArr​(String[] wordList)
      This just stores a list of "words", and they are removed from certain texts/articles.
      static char toLowerCaseSpanish​(char c)
      Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character.
      static String toLowerCaseSpanish​(String s)
      This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns a String n which all characters are lower-case, but have their punctuation preserved.
      static char toNonAccented​(char c, boolean preserveCase)
      This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent.
      static String toNonAccented​(String s, boolean preserveCase)
      Removes Spanish-Accent Characters from all characters in a string.
      static char toUpperCaseSpanish​(char c)
      Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character.
      static String toUpperCaseSpanish​(String s)
      This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • GRAVE

        🡇     🗕  🗗  🗖
        public static final int GRAVE
        GRAVE & ACCUTE are the "first bit" of this mask, if that bit is '0', then the mask is ACCUTE
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         public static final int GRAVE = 0b0001;
        
      • UPPERCASE

        🡅  🡇     🗕  🗗  🗖
        public static final int UPPERCASE
        UPPER & LOWER CASE are the "second bit" of this mask, if that bit is '0', then he mask is LOWER-CASE
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         public static final int UPPERCASE	= 0b0010;
        
    • Method Detail

      • getAccentedVowel

        🡅  🡇     🗕  🗗  🗖
        public static char getAccentedVowel​(char vowel,
                                            int flags)
        This is intended to produce an accented vowel 'on request' from the method invocation. The complete list of characters that may be returned by this function are listed below.


        Upper, GraveUpper, AcuteLower, GraveLower, Acute
        À (192)Á (193)à (224)á (225)
        È (200)É (201)è (232)é (233)
        Ì (204)Í (205)ì (236)í (237)
        Ò (210)Ó (211)ò (242)ó (243)
        Ù (217)Ú (218)ù (249)ú (250)
        Parameters:
        vowel - Any vowel: [A, E, I, O, U] or [a, e, i, o, u]

        If 'vowel' is not one of these 10 choices, then other characters will be ignored, and this method will just return (char) 0.
        flags - The following values can be OR'D (masked): Helper.GRAVE or Helper.UPPERCASE

        In total, there are 4 possible versions: Upper-Case/Lower-Case output, and Accute/Grave output.

        • If Helper.GRAVE is not masked (binary-bit 0), then an "accute" accented vowel is returned (accute is "the default").
        • If Helper.UPPERCASE is not masked (binary-bit 1), then a lower-case vowel is returned (lower-case is "the default").
        Returns:
        With correct input: one of ten listed vowels above - and if not, then ASCII 0 is returned.
        Code:
        Exact Method Body:
         int i = 0;
        
         if		((vowel == 'a') || (vowel == 'A')) i = 192;
         else if	((vowel == 'e') || (vowel == 'E')) i = 200;
         else if ((vowel == 'i') || (vowel == 'I')) i = 204;
         else if ((vowel == 'o') || (vowel == 'O')) i = 210;
         else if ((vowel == 'u') || (vowel == 'U')) i = 217;
         else return (char) 0;
        
         // À (192)È (200)Ì (204)Ò (210)Ù (217)
         if (    ((flags & UPPERCASE) > 0)
             &&  ((flags & GRAVE) > 0)
         )
             return (char) (i + 0);
        
         // Á (193)É (201)Í (205)Ó (211)Ú (218)
         else if	((flags & UPPERCASE) > 0) return (char) (i + 1);
        
         // à (224)è (232)ì (236)ò (242)ù (249)
         else if ((flags & GRAVE) > 0) return (char) (i + 32);
        
         // á (225)é (233)í (237)ó (243)ú (250)
         else return (char) (i + 33);
        
      • toNonAccented

        🡅  🡇     🗕  🗗  🗖
        public static char toNonAccented​(char c,
                                         boolean preserveCase)
        This converts all Spanish-Accented characters into a lower-case, and non-accented equivalent. Also, upper-case regular characters are down-cased. If specifically requested, case can be preserved.

        A (65) ... Z (90) ⇒ a .. z
        À (192), Á (193), à (224), á (225) ⇒ A or a
        È (200), É (201), è (232), é (233) ⇒ E or e
        Ì (204), Í (205), ì (236), í (237) ⇒ I or i
        Ò (210), Ó (211), ò (242), ó (243) ⇒ O or o
        Ù (217), Ú (218), ù (249), ú (250) ⇒ U or u
        Ñ (209), ñ (241) ⇒ N or n
        Ü (220), ü (252) ⇒ U or u
        Ý (221), ý (253) ⇒ Y or y
        Parameters:
        c - Any ASCII/UniCode character
        preserveCase - If this is TRUE, then accented capital letters remain capitlized. If this is FALSE, then all letters are converted to lowercase.
        Returns:
        If this character contained an accent, it will be removed. It will also be in lower-case form, unless preserveCase is TRUE.
        Code:
        Exact Method Body:
         if ((c == 224) || (c == 225))   return 'a';
         if ((c == 232) || (c == 233))   return 'e';
         if ((c == 236) || (c == 237))   return 'i';
         if ((c == 242) || (c == 243))   return 'o';
         if ((c == 249) || (c == 250))   return 'u';
         if (c == 241)                   return 'n';
         if (c == 252)                   return 'u';
         if (c == 253)                   return 'y';
        
         if ((c == 192) || (c == 193))   return (preserveCase ? 'A' : 'a');
         if ((c == 200) || (c == 201))   return (preserveCase ? 'E' : 'e');
         if ((c == 204) || (c == 205))   return (preserveCase ? 'I' : 'i');
         if ((c == 210) || (c == 211))   return (preserveCase ? 'O' : 'o');
         if ((c == 217) || (c == 218))   return (preserveCase ? 'U' : 'u');
         if (c == 209)                   return (preserveCase ? 'N' : 'n');
         if (c == 220)                   return (preserveCase ? 'U' : 'u');
         if (c == 221)                   return (preserveCase ? 'Y' : 'y');
        
         if ((c >= 'A') && (c <= 'Z'))   return (char) (preserveCase ? c : (c -'A' + 'a'));
        
         return c;
        
      • toNonAccented

        🡅  🡇     🗕  🗗  🗖
        public static java.lang.String toNonAccented​(java.lang.String s,
                                                     boolean preserveCase)
        Removes Spanish-Accent Characters from all characters in a string.
        Returns:
        a new String, one where toNonAccented(s.charAt(i), preserveCase) has been called for each character in the String. This is just a small for-loop over a String.
        See Also:
        toNonAccented(char, boolean)
        Code:
        Exact Method Body:
         StringBuilder   sb  = new StringBuilder();
         int             len = s.length();
        
         for (int i=0; i < len; i++) sb.append(toNonAccented(s.charAt(i), preserveCase));
        
         return sb.toString();
        
      • toLowerCaseSpanish

        🡅  🡇     🗕  🗗  🗖
        public static char toLowerCaseSpanish​(char c)
        Produces a lower-case Spanish Character - if and only if the input-parameter is an upper-case Spanish Character. This is almost identifical to the usual String function toLowerCase(char), but it also includes Spanish vowels and consonants with:

        • accent marks: À, Á, à, and á ... etc.
        • umlaut's: Ü and ü
        • tildes: Ñ and ñ

        NOTE: The 'accute' and 'grave' accent marks are not so prevalently used anymore as in the time of "Don Quijote de la Mancha" - however, they are included here, just in case. Mostly the 'acute' accent mark (from top-right-corner to the lower-left-corner) is used in newspapers around here (Dallas, Texas).
        Parameters:
        c - Any ASCII or UniCode char
        Returns:
        Uppercase letters 'A' .. 'Z' are converted to 'a' .. 'z'
        AND:

        À (192), Á (193) ⇒ à (224), á (225)
        È (200), É (201) ⇒ è (232), é (233)
        Ì (204), Í (205) ⇒ ì (236), í (237)
        Ò (210), Ó (211) ⇒ ò (242), ó (243)
        Ù (217), Ú (218) ⇒ ù (249), ú (250)
        Ñ (209) ⇒ ñ (241)
        Ý (221) ⇒ ý (253)
        Ü (220) ⇒ ü (252)
        See Also:
        toUpperCaseSpanish(char), toLowerCaseSpanish(String)
        Code:
        Exact Method Body:
         if ((c >= 'A') && (c <= 'Z')) return (char) (c + 'a' - 'A');
        
         else if (
                 (c == 192) || (c == 193) || (c == 200) || (c == 201)
             ||  (c == 204) || (c == 205) || (c == 210) || (c == 211)
             ||  (c == 217) || (c == 218) || (c == 209) || (c == 220)
             ||  (c == 221)
         )
             return (char) (c + 32);
        
         return c;
        
      • toLowerCaseSpanish

        🡅  🡇     🗕  🗗  🗖
        public static java.lang.String toLowerCaseSpanish​(java.lang.String s)
        This cycles through an input-String parameter, and converts any/all letters that are uppercase - including ones with accent marks, tildes, and umlaut's, and returns a String n which all characters are lower-case, but have their punctuation preserved.
        Returns:
        a new string in which Helper.toLowerCaseSpanish(char) has been invoked on each character.
        See Also:
        toLowerCaseSpanish(char)
        Code:
        Exact Method Body:
         StringBuilder ret = new StringBuilder();
         for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i)));
         return ret.toString();
        
      • toUpperCaseSpanish

        🡅  🡇     🗕  🗗  🗖
        public static char toUpperCaseSpanish​(char c)
        Produces an upper-case Spanish Character - if and only if the input-parameter is a lower-case Spanish Character. See toLowerCaseSpanish(char) for more notes!
        Parameters:
        c - Any ASCII or UniCode char
        Returns:
        Lowercase letters 'a' .. 'z' are converted to 'A' .. 'Z'

        AND:

        à (224), á (225) ⇒ À (192), Á (193)
        è (232), é (233) ⇒ È (200), É (201)
        ì (236), í (237) ⇒ Ì (204), Í (205)
        ò (242), ó (243) ⇒ Ò (210), Ó (211)
        ù (249), ú (250) ⇒ Ù (217), Ú (218)
        ñ (241) ⇒ Ñ (209)
        ý (253) ⇒ Ý (221)
        ü (252) ⇒ Ü (220)
        See Also:
        toLowerCaseSpanish(char), toUpperCaseSpanish(String)
        Code:
        Exact Method Body:
         if ((c >= 'a') && (c <= 'z'))
             return (char) (c + 'A' - 'a');
        
         else if (	(c == 224) || (c == 225) || (c == 232) || (c == 233)
                 ||  (c == 236) || (c == 237) || (c == 242) || (c == 243)
                 ||  (c == 249) || (c == 250) || (c == 241) || (c == 253)
                 ||  (c == 252)
             )
             return (char) (c - 32);
        
         return c;
        
      • toUpperCaseSpanish

        🡅  🡇     🗕  🗗  🗖
        public static java.lang.String toUpperCaseSpanish​(java.lang.String s)
        This cycles through an input-String parameter, and converts any/all letters that are lower-case, including ones with accent marks, tildes, and umlaut's, and returns a String in which all characters are upper-case, but have their punctuation preserved.
        Returns:
        a new string in which Helper.toUpperCaseSpanish(char) has been invoked on each character.
        See Also:
        toUpperCaseSpanish(char)
        Code:
        Exact Method Body:
         StringBuilder ret = new StringBuilder();
         for (int i=0; i < s.length(); i++) ret.append(toLowerCaseSpanish(s.charAt(i)));
         return ret.toString();
        
      • isLanguageChar

        🡅  🡇     🗕  🗗  🗖
        public static boolean isLanguageChar​(char c)
        Checks if this character could be a Spanish Language Character
        Parameters:
        c - Any ASCII or Uni-Code Character
        Returns:
        TRUE: If and only if 'c' is one of the following char-sets:

        • a ... z
        • A ... Z
        • Á (193), É (201), Í (205), Ó (211), Ú (218), Ý (221), Ü (220), Ñ (209)
        • á (225), é (233), í (237), ó (243), ú (250), ý (253), ü (252), ñ (241)

        and FALSE otherwise...
        Code:
        Exact Method Body:
         if ((c >= 'a') && (c <= 'z')) return true;
         if ((c >= 'A') && (c <= 'Z')) return true;
        
         // Á 193, É 201, Í 205, Ó 211, Ú 218, Ý 221, Ü 220, Ñ 209
         if (    (c == 193) || (c == 201) || (c == 205) || (c == 211) || (c == 218) || (c == 221)
             ||  (c == 220) || (c == 209))
             return true;
        
         // á 225, é 233, í 237, ó 243, ú 250, ý 253, ü 252, ñ 241
         if (    (c == 225) || (c == 233) || (c == 237) || (c == 243) || (c == 250) || (c == 253)
              || (c == 252) || (c == 241))
             return true;
        
         return false;
        
      • onlyLanguageChars

        🡅  🡇     🗕  🗗  🗖
        public static boolean onlyLanguageChars​(java.lang.String s)
        Checks if a String contains non-Spanish-Language Characters. Utilizes isLanguageChar(char)
        Parameters:
        s - Any String consisting of ASCII & UniCode Characters
        Returns:
        TRUE only if isLanguageChar(s.charAt(i)) returns TRUE for ever integer i, and FALSE otherwise.
        See Also:
        isLanguageChar(char)
        Code:
        Exact Method Body:
         for (int i=0; i < s.length(); i++) if (! isLanguageChar(s.charAt(i))) return false;
         return true;
        
      • isSpanishVerbInfinitive

        🡅  🡇     🗕  🗗  🗖
        public static boolean isSpanishVerbInfinitive​(java.lang.String s)
        This is a function which identifies Spanish Language Infinitive Form Verbs.
        Parameters:
        s - Any String consisting of ASCII & UniCode Characters
        Returns:
        TRUE if and only if:
        input-parameter 's' ends with: ar, er, ir, arse, erse, irse, ír, írse
        's' passes the onlyLanguageChars(String) boolean test
        FALSE otherwise
        See Also:
        onlyLanguageChars(String)
        Code:
        Exact Method Body:
         s = toLowerCaseSpanish(s);
        
         if (onlyLanguageChars(s))
             if (    s.endsWith("ar")	|| s.endsWith("er")		|| s.endsWith("ir")
                 ||  s.endsWith("arse")	|| s.endsWith("erse")	|| s.endsWith("irse")
                 ||  s.endsWith("ír")	|| s.endsWith("írse"))
                 return true;
        
         return false;
        
      • convertHTML_TO_UTF8

        🡅  🡇     🗕  🗗  🗖
        public static java.lang.String convertHTML_TO_UTF8​(java.lang.String s)
        This function is somewhat redundant, as a complete HTML-Character Escape-Sequence class is included in the Torello.HTML package. There is a link provided to these methods at the end of this comment. This method was written much earlier, and functions well, but it can only convert HTML-Escape-Sequences that are used in Spanish - rather than all HTML-Character Escape-Sequences. Here is the complete list:

        &aacute;⇒ á
        &eacute; ⇒ é
        &iacute;⇒ í
        &oacute; ⇒ ó
        &uacute;⇒ ú
        &Aacute; ⇒ Á
        &Eacute;⇒ É
        &Iacute; ⇒ Í
        &Oacute;⇒ Ó
        &Uacute; ⇒ Ú
        &ntilde;⇒ ñ
        &laquo; ⇒ «
        &raquo; ⇒ »
        &mdash; ⇒ -
        &uuml; ⇒ ü
        &iuml; ⇒ ï
        &iexcl; ⇒ ¡
        &iquest; ⇒ ¿
        &quot; ⇒ "
        Parameters:
        s - Any ASCII/UniCode String, which ostensibly ought to (possibly) contain Spanish-Language HTML-Escaped characters within them.
        Returns:
        A string where all HTML escape-sequences have been converted to their actual character equivalent.
        See Also:
        Escape.escHTMLToChar(String), Escape.htmlEsc(char), StrReplace.r(String, String[], char[])
        Code:
        Exact Method Body:
         return StrReplace.r(s, ESC_STRS, REPL_CHARS);
        
      • setRemoveWordsArr

        🡅  🡇     🗕  🗗  🗖
        public static void setRemoveWordsArr​(java.lang.String[] wordList)
        This just stores a list of "words", and they are removed from certain texts/articles. This program currently uses it to remove certain extremely commonly used words, so they are not repeatedly searched for in the dictionary. It is kind of a hack.
        Parameters:
        wordList - An array of Strings. It is expected to be a list of words that may be removed from Spanish Texts, but it can be any list of words. It is checked to see if 100% of the characters in each word are alphabetic, and throws an IllegalArgumentException if they are not.
        Throws:
        java.lang.IllegalArgumentException - if the wordList parameter contains strings with invalid non-word characters.
        Code:
        Exact Method Body:
         removeList = new Vector<String>();
                
         for (int i=0; i < wordList.length; i++)
         {
             String word = wordList[i];
        
             for (int j=0; j < word.length(); j++)
        
                 if (! isLanguageChar(word.charAt(j))) throw new IllegalArgumentException(
                     "Contains word:" + word + " which has invalid, non-word, language-characters");
        
             removeList.addElement(word);
         }
        
      • removeWords

        🡅     🗕  🗗  🗖
        public static java.lang.String removeWords​(java.lang.String s)
        This function references the words in the "removeList" and removes every occurence of each word that is present in the "removeList" Vector<String>
        Parameters:
        s - A String of Spanish Words.
        Returns:
        The same string with each instance of each word that is listed in the "removeList" Vector removed from the String
        See Also:
        setRemoveWordsArr(String[])
        Code:
        Exact Method Body:
         // boolean printIt = false;
         // int tpos = s.indexOf(" a ");
         // if (tpos != -1) if (s.indexOf(" a ", tpos + 3) != -1) printIt = true;
         // if (printIt) System.out.println(s + ":");
                
         Enumeration<String> e = removeList.elements();
         // System.out.println("CLEANING: [" + s + "]");
        
         while (e.hasMoreElements())
         {
             String lc = toLowerCaseSpanish(s);
        
             // System.out.print(" <" + lc + ">");
             String word = e.nextElement();
        
             // System.out.print(" {" + word + "}");
            
             int pos = 0;
             while ((pos = lc.indexOf(word, pos)) != -1)
             {
                 int     startPos    = pos;
                 int     endPos      = pos + word.length();
                 boolean leftEnd     = (startPos == 0);
                 boolean rightEnd    = (endPos == lc.length());
                 char    leftChar    = leftEnd ? 0 : lc.charAt(startPos - 1);
                 char    rightChar   = rightEnd ? 0 : lc.charAt(endPos);
        
                 // if (printIt) System.out.print("(" + leftChar + "," + rightChar + "," + leftEnd +
                 // "," + rightEnd + "," + startPos + "," + endPos + ") ");
            
                 if (isLanguageChar(leftChar))   { pos = endPos; continue; }
                 if (isLanguageChar(rightChar))  { pos = endPos; continue; }
        
                 // System.out.print("(" + startPos + "," + endPos + ")" );
                 boolean leftSpace = (leftChar == ' ');
                 boolean rightSpace = (rightChar == ' ');
        
                 if (leftSpace && rightSpace)    startPos--;
                 else if (leftSpace && rightEnd) startPos--;
                 else if (leftEnd && rightSpace) endPos++;
                        
                 s = (leftEnd ? "" : s.substring(0, startPos)) +
                     (rightEnd ? "" : s.substring(endPos));
        
                 // if (printIt) System.out.print("[" + s + "] ");
                 lc = toLowerCaseSpanish(s);
             }
         }
        
         // if (printIt) System.out.println("\n");
         return s;