Package Torello.HTML
Class Escape
- java.lang.Object
-
- Torello.HTML.Escape
-
public final class Escape extends java.lang.Object
Easy utilities for escaping and un-escaping HTML characters such as , and even code-point based Emoji's.
There are dozens of "Escaped HTML" symbols in the HTML language. This class helps convert from an "escaped character" to the underlying/actual UTF-8 or ASCII'char'(or in-the-reverse / vice-versa).
Hi-Lited Source-Code:- View Here: Torello/HTML/Escape.java
- Open New Browser-Tab: Torello/HTML/Escape.java
File Size: 22,447 Bytes Line Count: 528 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@StatelessAnnotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 11 Method(s), 11 declared static
- 6 Field(s), 6 declared static, 6 declared final
-
-
Method Summary
Basic Methods Modifier and Type Method static booleanhasHTMLEsc(char c)static voidprintHTMLEsc()Escape Characters to HTML Escape-Strings Modifier and Type Method static StringescChar(char c, boolean use16BitEscapeSequence)static StringescCodePoint(int codePoint, boolean use16BitEscapeSequence)static StringhtmlEsc(char c)Un-Escape HTML Escape-Strings to Characters Modifier and Type Method static charescHTMLToChar(String escHTML)static Stringreplace(String s)static StringreplaceAll(String s)static StringreplaceAll_DEC(String str)static StringreplaceAll_HEX(String str)static StringreplaceAll_TEXT(String str)
-
-
-
Method Detail
-
printHTMLEsc
public static void printHTMLEsc()
Print's the HTML Escape Character lookup table toSystem.out. This is useful for debugging.
View Escape-Codes:
The JAR Data-File List included within the page attached (below) is a complete list of alltext-StringHTML Escape Sequences that are known to this class. This list, does not include anyCode Point, HexorDecimal Numbersequences.All HTML Escape Sequences- Code:
- Exact Method Body:
Enumeration<String> e = htmlEscChars.keys(); while (e.hasMoreElements()) { String tag = e.nextElement(); System.out.println("&" + tag + "; ==> " + htmlEscChars.get(tag)); }
-
escHTMLToChar
public static char escHTMLToChar(java.lang.String escHTML)
Converts a singleStringfrom an HTML-escape sequence into the appropriate character.
&[escape-sequence];==> actual ASCII or UniCode character.- Parameters:
escHTML- An HTML escape sequence.- Returns:
- the
ASCIIorUnicodecharacter represented by this escape sequence.
This method will return'0'if the input it does not represent a valid HTML Escape sequence. - Code:
- Exact Method Body:
if (! escHTML.startsWith("&") || ! escHTML.endsWith(";")) return (char) 0; String s = escHTML.substring(1, escHTML.length() - 1); // Temporary Variable. int i = 0; // Since the EMOJI Escape Sequences use Code Point, they cannot, generally be // converted into a single Character. Skip them. if (HEX_CODE.matcher(s).find()) { if ((i = Integer.parseInt(s.substring(2), 16)) < Character.MAX_VALUE) return (char) i; else return 0; } // Again, deal with Emoji's here... Parse the integer, and make sure it is a // character in the standard UNICODE range. if (DEC_CODE.matcher(s).find()) { if ((i = Integer.parseInt(s.substring(1))) < Character.MAX_VALUE) return (char) i; else return 0; } // Now check if the provided Escape String is listed in the htmlEscChars Hashtable. Character c = htmlEscChars.get(s); // If the character was found in the table that lists all escape sequence characters, // then return it. Otherwise just return ASCII zero. return (c != null) ? c.charValue() : 0;
-
replaceAll_HEX
public static java.lang.String replaceAll_HEX(java.lang.String str)
Will generate aStringwhereby any & all Hexadecimal Escape Sequences have been removed and subsequently replaced with their actual ASCII/UniCode un-escaped characters!
Hexadecimal HTML Escape-Sequence Examples:Substring from Input: Web-Browser Converts To: ª'ª'within a browserg'g'within a browser„''within a browser
This method might be thought of as similar to the older C/C++'Ord()'function, except it is for HTML.- Parameters:
str- anyStringthat contains an HTML Escape Sequence &#x[HEXADECIMAL VALUE];- Returns:
- a
String, with all of the hexadecimal escape sequences removed and replaced with their equivalent ASCII or UniCode Characters. - See Also:
replaceAll_DEC(String str),StrReplace.r(String, String[], char[])- Code:
- Exact Method Body:
// This is the RegEx Matcher from the top. It matches string's that look like: &#x\d+; Matcher m = HEX_CODE.matcher(str); // Save the escape-string regex search matches in a TreeMap. We need to use a // TreeMap because it is much easier to check if a particular escape sequence has already // been found. It is easier to find duplicates with TreeMap's. TreeMap<String, Character> escMap = new TreeMap<>(); while (m.find()) { // Use Base-16 Integer-Parse int i = Integer.valueOf(m.group(1), 16); // Do not un-escape EMOJI's... It makes a mess - they are sequences of characters // not single characters. if (i > Character.MAX_VALUE) continue; // Retrieve the Text Information about the HTML Escape Sequence String text = m.group(); // Check if it is a valid HTML 5 Escape Sequence. if (! escMap.containsKey(text)) escMap.put(text, Character.valueOf((char) i)); } // Build the matchStr's and replaceChar's arrays. These are just the KEY's and // the VALUE's of the TreeMap<String, Character> which was just built. // NOTE: A TreeMap is used *RATHER THAN* two parallel arrays in order to avoid keeping // duplicates when the replacement occurs. String[] matchStrs = escMap.keySet().toArray(new String[escMap.size()]); char[] replaceChars = new char[escMap.size()]; // Lookup each "ReplaceChar" in the TreeMap, and put it in the output "replaceChars" // array. The class StrReplace will replace all the escape squences with the actual // characters. for (int i=0; i < matchStrs.length; i++) replaceChars[i] = escMap.get(matchStrs[i]); return StrReplace.r(str, matchStrs, replaceChars);
-
replaceAll_DEC
public static java.lang.String replaceAll_DEC(java.lang.String str)
This method functions the same asreplaceAll_HEX(String)- except it replaces only HTML Escape sequences that are represented using decimal (base-10) values.'replaceAll_HEX(...)'works on hexadecimal (base-16) values.
Base-10 HTML Escape-Sequence Examples:Substring from Input: Web-Browser Converts To: 0'0'in your browser@'@'in your browser{'{'in your browser}'}'in your browser
Base-10 & Base-16 Escape-Sequence Difference:-
&#x[hex base-16 value];There is an'x'as the third character in theString -
&#[decimal base-10 value];There is no'x'in the escape-sequenceString!
This short example delineates the difference between an HTML escape-sequence that employsBase-10numbers, and one usingBase-16(Hexadecimal) numbers.- Parameters:
str- anyStringthat contains the HTML Escape Sequence&#[DECIMAL VALUE];.- Returns:
- a
String, with all of the decimal escape sequences removed and replaced with ASCII UniCode Characters.
If this parameter does not contain such a sequence, then this method will return the same input-Stringreference as its return value. - See Also:
replaceAll_HEX(String str),StrReplace.r(String, String[], char[])- Code:
- Exact Method Body:
// This is the RegEx Matcher from the top. It matches string's that look like: &#\d+; Matcher m = DEC_CODE.matcher(str); // Save the escape-string regex search matches in a TreeMap. We need to use a // TreeMap because it is much easier to check if a particular escape sequence has already // been found. It is easier to find duplicates with TreeMap's. TreeMap<String, Character> escMap = new TreeMap<>(); while (m.find()) { // Use Base-10 Integer-Parse int i = Integer.valueOf(m.group(1)); // Do not un-escape EMOJI's... It makes a mess - they are sequences of characters // not single characters. if (i > Character.MAX_VALUE) continue; // Retrieve the Text Information about the HTML Escape Sequence String text = m.group(); // Check if it is a valid HTML 5 Escape Sequence. if (! escMap.containsKey(text)) escMap.put(text, Character.valueOf((char) i)); } // Build the matchStr's and replaceChar's arrays. These are just the KEY's and // the VALUE's of the TreeMap<String, Character> which was just built. // NOTE: A TreeMap is used *RATHER THAN* two parallel arrays in order to avoid keeping // duplicates when the replacement occurs. String[] matchStrs = escMap.keySet().toArray(new String[escMap.size()]); char[] replaceChars = new char[escMap.size()]; // Lookup each "ReplaceChar" in the TreeMap, and put it in the output "replaceChars" // array. The class StrReplace will replace all the escape sequences with the actual // characters. for (int i=0; i < matchStrs.length; i++) replaceChars[i] = escMap.get(matchStrs[i]); return StrReplace.r(str, matchStrs, replaceChars);
-
-
replaceAll_TEXT
public static java.lang.String replaceAll_TEXT(java.lang.String str)
Replaces all HTML Escape Sequences that contain text-word escape-sequences.
Standard (Text) HTML Escape-Sequence Examples:ASCII or UNICODE: Can be Escaped Using: "(double-quote)"(in HTML)&(ampersand)&(in HTML)<(less-than)<(in HTML)>(greater-than)>(in HTML
View Escape-Codes:
The list included within the page attached (below) is a complete list of all Text-StringHTML Escape Sequences known to this class. This list, does not include anyCode Point, HexorDecimal Numbersequences.All HTML Escape Sequences- Parameters:
str- anyStringthat contains HTML Escape Sequences that need to be converted to their ASCII-UniCode character representations.- Returns:
- a
String, with all of the decimal escape sequences removed and replaced with ASCII UniCode Characters. - Throws:
java.lang.IllegalStateException- See Also:
replaceAll_HEX(String str),StrReplace.r(String, boolean, String[], Torello.Java.Function.ToCharIntTFunc)- Code:
- Exact Method Body:
// We only need to find which escape sequences are in this string. // use a TreeSet<String> to list them. It will Matcher m = TEXT_CODE.matcher(str); TreeMap<String, String> escMap = new TreeMap<>(); while (m.find()) { // Retrieve the Text Information about the HTML Escape Sequence String text = m.group(); String sequence = text.substring(1, text.length() - 1); // Check if it is a valid HTML 5 Escape Sequence. if ((! escMap.containsKey(text)) && htmlEscChars.containsKey(sequence)) escMap.put(text, sequence); } // Convert the TreeSet to a String[] array... and use StrReplace String[] escArr = new String[escMap.size()]; return StrReplace.r( str, false, escMap.keySet().toArray(escArr), (int i, String sequence) -> htmlEscChars.get(escMap.get(sequence)) );
-
replaceAll
@Deprecated public static java.lang.String replaceAll(java.lang.String s)
Deprecated.Calls all of the HTML Escape Sequence convert/replaceStringfunctions at once.- Parameters:
s- This may be any JavaStringwhich may (or may not) contain HTML Escape sequences.- Returns:
- a new
Stringwhere all HTML escape-sequence substrings have been replaced with their natural character representations. - See Also:
replaceAll_DEC(String),replaceAll_HEX(String),replaceAll_TEXT(String)- Code:
- Exact Method Body:
return replaceAll_HEX(replaceAll_DEC(replaceAll_TEXT(s)));
-
replace
public static java.lang.String replace(java.lang.String s)
This is an optimized HTMLString-replacement method. It will substitute all HTML Escape Sequences with the actual characters they represent.
Emoji's:
In keeping with the other methods in this class, if there are any HTMLEmojiEscape Sequences, these shall not be replaced.Emoji'swork on the principle ofCode-Point, and though replacing such escape sequences is not difficult, because they work in theCode-Pointspace, their substitutions are never single character representations (there are always at least two Javachar'sused per oneCode Point).
There is an alternate method that can substitute the actual Javachar'sfor aCode-PointEscape-Sequence.
Code-Point:
For those familiar withCode Point, the wau this method works is that it just skips any escaped sequence that use Base-10 or Base-16 Representations if the number inside the Escape-Sequence is larger thanCharacter.MAX_VALUE.
It is important to remember that all JavaString'sare simplychar-Arrays which are wrapped in anjava.lang.Stringclass instance. Since the Primitive Type'char'is fundamentally a 16-bit character, no character can be converted if it is larger than this value. Although Code Point works just fine in Java, it is left as a separate method in this class.
Rendering Emoji's:
Many standard web-pages use very little of the more advanced Escape-Sequences.Emoji'sare somewhat popular. The issue isn't about whether the'Code Point'based Escape-Sequences can be converted or handled, but rather it is about whether or not your really want to leave the comfortable world of HTML Escape-Sequences for yourCode Pointrelated characters.
Once aCode Pointsequence has been un-escaped, it will only be visible in text-editors / viewers that are capable of renderingCode Point'sorEmoji's(and not all text editors can do this!)- Parameters:
s- This may be any JavaStringwhich may (or may not) contain HTML Escape sequences.- Returns:
- a new
Stringwhere all HTML escape-sequence substrings have been replaced with their natural character representations. - Code:
- Exact Method Body:
return EscapeRepl.replace(s);
-
escChar
public static java.lang.String escChar(char c, boolean use16BitEscapeSequence)
This method shall simply escape anycharinto an HTML EscapeString.Input 'char'Returned String's'中'(Middle / China)"中"(Base 10)
"中"(Base 16)'日'(Japan / Sun)"日"(Base 10)
"日"(Base 16)'Ñ'(Spanish Tilda)"Ñ"(Base 10)
"Ñ"(Base 16)'ñ'(Lower-Case Tilda)"ñ"(Base 10)
"ñ"(Base 16)'☃'(Snowman Glyph)"☃"(Base 10)
"☃"(Base 16)
Java'char'Primitive-Type:
The java primitive'char'type, which, again, is a16-bit (2^16 65,535)type essentially equates to the primary plane (plane 0) of the 17 UNICODE planes. This is also known as the Basic Multi-Lingual Plane.
Here, likely any foreign language character, needed by a programmer (including all Chinese Character Glyphs) are easily found with a bit of searching. Any modern web-browser can display these characters, if they are escaped using an the HTML Escape Sequences returned by this method.
Modern-Browsers & UTF-8:
As an aside, if a programmer includes the HTML Element:<META CHARSET="utf-8">in the<HEAD>...</HEAD>portion of an HTML Page, it becomes easy to include such characters (from the Multi-Lingual Plane) without even needing to use Escape-Sequences for the characters.
Any Web-Browser which knows before-hand that non-ASCII characters (higher than character#255 / 0xFF) are being transmitted, will interpret them usingUTF-8. In this case escaping thechar'sthem becomes unnecessary.- Parameters:
c- Any Java Character. Note that the Java Primitive Type'char'is a 16-bit type. This parameter equates to the UNICODE Characters0x0000up to0xFFFF.use16BitEscapeSequence- If the user would like the returned, escaped,Stringto use Base 16 for the escaped digits, passTRUEto this parameter. If the user would like to retrieve an escapedStringthat uses standard Base 10 digits, then passFALSEto this parameter.- Returns:
- The passed character parameter
'c'will be converted to an HTML Escape Sequence. For instance if the character'ᡃ', which is the Chinese Character for I, Me, Myself were passed to this method, then theString"我"would be returned.
If the parameter'use16BitEscapeSequence'had been passedTRUE, then this method would, instead, return theString "我". - Code:
- Exact Method Body:
return use16BitEscapeSequence ? "&#" + ((int) c) + ";" : "&#x" + Integer.toHexString((int) c).toUpperCase() + ";";
-
escCodePoint
public static java.lang.String escCodePoint(int codePoint, boolean use16BitEscapeSequence)
This method shall simply escape anyCode Pointpoint integer into an HTML EscapeString. Below is a list of a few examples ofCode Pointscommonly used. As stated, most of the Basic Multi Lingual Plane - which isPlane 0of the UNICODE Space fits into the16-bitjavaPrimitive Type 'char'. For such situations,"Code Points"have very little application to software. Essentially, Java's16-bit 'char' primitive typegives that to the programmer "for free" - without needing to think past, again, Java'sprimitive-type 'char'.
Although"Code Points"were developed decades ago, today, one of the most common uses for them are theEmoji'sbeing used on numerous web-sites. It is important to note that not allEmoji'swill fit into a singleCode Point, and, as such, equating a"Code Point"with an"Emoji"is actually incorrect. However, for the more complicatedEmoji'savailable, all that is really going on is that sequences ofcode pointsare being sent and interpreted by the web-browser - as a single glyph or character-image.
Escaping Emoji's:
Just as with Foreign-Language characters, thecode-pointsthemselves (without having been escaped) can be included directly into a text file, as long as the HTML-File indicates that non-ASCII, orUTF-8data is being transmitted. In such cases, to avoid using these Escape-Sequences at all, just include the usual Javachar'sin themetatag in the HTML<HEAD>...</HEAD>section, as follows:
HTML-Tag to Include:<META CHARSET="utf-8">.
And here is a (very) brief sample table of Emoji's and their HTML Escape-Sequences:Input Code Point (int)Returned String's😀 (Grinning Face)
(128512)"😀"(Base 10)
"😀"(Base 16)👍 (Thumb's Up)
(128077)"👍"(Base 10)
"👍"(Base 16)🌮 (Taco)
(127790)"🌮"(Base 10)
"🌮"(Base 16)'A'(Upper-Case A)
(ASCII# 65)"A"(Base 10)
"A"(Base 16)'0'(Number Zero)
(ASCII# 48)"0"(Base 10)
"0"(Base 16)'中'(Middle-China)
(20013)"中"(Base 10)
"中"(Base 16)'ü'(German Umlaut)
(252)"ü"(Base 10)
"ü"(Base 16)'Ñ'(Spanish Tilda)
(209)"Ñ"(Base 10)
"Ñ"(Base 16)
Again, If the'.html'files you are providing to a web-browser indicate the<META CHARSET="utf-8">, it is not necessary to provide HTML escape sequences for anEmoji, or any'Code Point'at all. Instead, if the text-editor you are using to edit your'.html'files can handlecode points, they may be included directly into the'html'file itself.
Multi-Code-Point Emoji's:
There are numerousEmoji'sthat are represented by sequences ofcode-points, AND NOT just a singlecode pointinteger. In such cases, providing HTML escape sequences will actually prevent the browser from rendering the "conglomerate"Emoji.
The Emoji's below do not need to be escaped, (because they are sequences ofcode points, rather than just singlecode points). Instead, theircode pointsmust be included directly into the'.html'file itself - or they will not be properly rendered by the web-browser...Emoji Code Point Sequence👁️🗨️
"Eye in Speech"U+1F441 U+200D U+1F5E8==>
👁 (Eye -0x1F441;) +
GLUE (0X200D;) +
🗨 (Speech Bubble -0x1F5E8)👉🏿
"Index-Finger Pointing, Dark Hand""U+1F449 U+1F3FF"==>
👉 (Index Finger Pointing -U+1F449) +
Dark Skin Color -U+1F3FF- Parameters:
codePoint- This will take any integer. It will be interpreted as aUNICODEcode point.Java uses 16-bit values for it's primitive'char'type. This is also the "first plane" of the UNICODE Space and actually referred to as the Basic Multi Lingual Plane. Any value passed to this method that is lower than65,535would receive the same escape-Stringthat it would from a call to the methodescChar(char, boolean).use16BitEscapeSequence- If the user would like the returned, escaped,Stringto use Base 16 for the escaped digits, passTRUEto this parameter. If the user would like to retrieve an escapedStringthat uses standard Base 10 digits, then passFALSEto this parameter.- Returns:
- The
code pointwill be converted to an HTML Escape Sequence, as ajava.lang.String. For instance if thecode pointfor "the snowman" glyph (character ☃), which happens to be represented by acode pointthat is below65,535(and, incidentally, does "fit" into a single Java'char') - this method would return theString "☃".
If the parameter'use16BitEscapeSequence'had been passedTRUE, then this method would, instead, return theString "☃". - Throws:
java.lang.IllegalArgumentException- Java has a method for determining whether any integer is a validcode point. Not all of the integers "fit" into the 17 Unicode "planes". Note that each of the planes in'Unicode Space'contain65,535(or2^16) characters.- Code:
- Exact Method Body:
if (! Character.isValidCodePoint(codePoint)) throw new IllegalArgumentException( "The integer you have passed to this method [" + codePoint + "] was deemed an " + "invalid Code Point after a call to: [java.lang.Character.isValidCodePoint(int)]. " + "Therefore this method is unable to provide an HTML Escape Sequence." ); return use16BitEscapeSequence ? "&#" + codePoint + ";" : "&#x" + Integer.toHexString(codePoint).toUpperCase() + ";";
-
hasHTMLEsc
public static boolean hasHTMLEsc(char c)
Check the internalEscape Sequence Lookup Table. If there is an escape sequenceStringassociated with thecharprovided to this method, then return TRUE. If there is no suchEscape Sequencein theLookup Tableassociated with parameter'c', then return FALSE.
TheLookup Tablecan identify whethercharparameter'c'has an associated HTML Escape Sequence, or not. Escape sequences are always short, text-String'sthat were selected by the w3C (long ago, in the 1990's).
Returns TRUE if there is an associatedStringescape-sequence forchar-parameter'c'parameter, and FALSE otherwise. Please review the brief sample table below:Input Character: Method Return Value: '&'(ampersand)TRUE'A'(letter-A)FALSE'<'(less-than-symbol)TRUE'9'(number-9)FALSE'>'(less-than-symbol)TRUE
View Escape-Codes:
The list included within the page attached (below) is a complete list of all Text-StringHTML Escape-Sequences that are known to this class. This list, does not include anyCode-Point, HexorDecimal-Numbersequences.All HTML Escape Sequences- Parameters:
c- Any ASCII or UNICODE Character- Returns:
TRUEif there is aStringescape sequence for this character, andFALSEotherwise.- See Also:
htmlEsc(char)- Code:
- Exact Method Body:
return htmlEscSeq.get(Character.valueOf(c)) != null;
-
htmlEsc
public static java.lang.String htmlEsc(char c)
Check the internalEscape Sequence Lookup Table. If there is an escape sequenceStringassociated with thecharprovided to this method, then return it.
For Instance:Input Character: Method Return Value: '&'"amp"'A'(letter-A)null'<'(less-than-symbol)"lt"'9'(number-9)null'>'(greater-than-symbol)"gt"
View Escape Codes:
The list included within the page attached (below) is a complete list of all Text-StringHTML Escape-Sequences that are known to this class. This list, does not include anyCode-Point, HexorDecimal-Numbersequences.All HTML Escape Sequences- Parameters:
c- Any ASCII or UNICODE Character- Returns:
- The
Stringthat is used by web-browsers to escape this ASCII / Uni-Code character - if there is one saved in the internalLookup Table. If the character provided does not have an associatedHTML Escape String, then 'null' is returned.The entire escape-Stringis not provided, just the inner-characters. The leading'&'(Ampersand) and the trailing';'(Semi-Colon) are not appended to the returnedString. - See Also:
hasHTMLEsc(char)- Code:
- Exact Method Body:
return htmlEscSeq.get(Character.valueOf(c));
-
-