Package Torello.HTML
Class Balance
- java.lang.Object
-
- Torello.HTML.Balance
-
public class Balance extends java.lang.Object
Utilities for checking that opening and closingTagNodeelements match up (that the HTML is balanced).
This class provides for one inspecting one particular aspect of HTML validity, that of properly balanced Opening & Closing Tags. Try to realize that these sorts of checks are not perfectly constructed. For instance in the case of this class, most Web-Browsers (as of the writing of this Java-Doc Explanation) do not even require that all HTML-Tags actually be closed. Even a page with opened-but-not-closed <DIV> will often display just fine in Google Chrome.
There are plenty of pages that will use opening <LI> tags inside an Ordered-List (a <OL>). This is also sometimes acceptible practice for <TR> and even <TD> tags.
Balance Heuristic:
All of the methods below generate an "Open and Closed Count" per page, for each HTML-Tag that the user has requested by counted. As an example, an HTML-Page, or page sub-section, containing 5 opening <DIV> tags, but only 4 closing<DIV>Tags would have Tag-Balance count of+1. This may seem somewhat trivial at first. Generally, how complicated it can be to hunt down "bugs" on HTML will hopefully make it clear that that tools that find such mistakes in HTML can be invaluable.
Balance Tag-Count Examples:
A Tag-Balance Count of'0'means that on the page-provided, there is exactly one closing-tag for each and every opening-tag on the page. A few sample return values are provided in the table below. Remember, the primary'balance'method in this class computes a count for each and every HTML-Tag present on the page. Usuallly tags with a'0'Count are removed from the result, and only non-zero tags are mentioned in the returnedHashtable.
Here are a few sample counts for some common HTML-Tags:Tag-Balance Count Meaning TD: -1There is an "extra" closing Table-Data Cell on the page. Specifically there is one more </TD> tag than needed. DIV: 0The number of <DIV> Tags is precisely equal to the number </DIV> tags on the page provided B: +1Somewhere on the page there is an opening Bold-Tag (a <B> Tag), that isn't actually closed by a </B> anywhere. TR: +2There are two Table-Rows whose opening <TR> tags aren't closed.
Depth Heuristic:
The 'depth' methods use a different approach to investigating HTMl-Tag validity. The 'depth' of an HTML-Tag is simply the level of nesting for any particular tag. The Maximum-Depth of an HTML-Tag is the deepest number nested tags that are present on a page. A page whose <DIV> tag has a Maximum-Depth of+3is a page where (in at least one location) there are three levels of nested-dividers.
The Minimum-Depth of a tag for any given page should usually be zero. A page that contains tags which have a negative-depth are considered invalid. A negative depth means that at some point there was a Closing-Tag place without there being any opening tag.
Hi-Lited Source-Code:- View Here: Torello/HTML/Balance.java
- Open New Browser-Tab: Torello/HTML/Balance.java
File Size: 27,881 Bytes Line Count: 701 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@StatelessAnnotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 16 Method(s), 16 declared static
- 0 Field(s)
-
-
Method Summary
Balanced Open-Closed HTML Tag Checks Modifier and Type Method static Hashtable<String,Integer>check(Vector<? super TagNode> html)static int[]check(Vector<? super TagNode> html, String... htmlTags)static Hashtable<String,Integer>checkNonZero(Hashtable<String,Integer> ht)static intcheckTag(Vector<? super TagNode> html, String htmlTag)static int[]nonNestedCheck(Vector<? super TagNode> html, String htmlTag)Balanced Open-Closed Tag Check & Print to String Modifier and Type Method static StringCB(Vector<HTMLNode> html)Nested HTML Tag Checks Modifier and Type Method static Hashtable<String,int[]>depth(Vector<? super TagNode> html)static Hashtable<String,int[]>depth(Vector<? super TagNode> html, String... htmlTags)static Hashtable<String,int[]>depthGreaterThanOne(Hashtable<String,int[]> ht)static Hashtable<String,int[]>depthInvalid(Hashtable<String,int[]> ht)static int[]depthTag(Vector<? super TagNode> html, String htmlTag)static Ret2<int[],int[]>locationsAndDepth(Vector<? super TagNode> html, String htmlTag)Printing Tag-Check Hashtables Modifier and Type Method static StringtoStringBalance(int[] balanceCheckReport, String... htmlTags)static StringtoStringBalance(Hashtable<String,Integer> balanceCheckReport)static StringtoStringDepth(Hashtable<String,int[]> depthReport)
-
-
-
Method Detail
-
CB
public static java.lang.String CB(java.util.Vector<HTMLNode> html)
Invokes:
Example:
String b = Balance.CB(a.articleBody); System.out.println((b == null) ? "Page has Balanced HTML" : b); // If Page has equal number of open and close tags prints: // Page Has Balanced HTML // OTHERWISE PRINTS REPORT
- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.- Returns:
- Will return null if the snippet or page has 'balanced' HTML, otherwise returns the
trimmed balance-report as a
String. - Code:
- Exact Method Body:
final String ret = toStringBalance(checkNonZero(check(html))); return (ret.length() == 0) ? null : ret;
-
check
public static java.util.Hashtable<java.lang.String,java.lang.Integer> check (java.util.Vector<? super TagNode> html)
Creates aHashtablethat has a count of all open and closed HTML tags found on the page.
ThisHashtablemay be regarded as maintaining "counts" on each-and-every HTML tag to identify whether there is a one-to-one balance mapping between opening and closing tags for each element. When theHashtablegenerated by this method is non-zero (for a particular HTML-Tag) it means that there are an unequal number of opening and closing elements for that tag.
Suppose this method were to produce aHashtable, and thatHashtablequeried for a count on the HTML<DIV>tag (dividers). If that count turned out to be a non-zero positive number it would mean that the Vectorized-HTML had more opening<DIV>tags than the number of closing</DIV>tags on that page.
Browser Validity:
There are some browser-parse advocates who may state that not all HTML Tags have to be closed. For instance, there are plenty of page out there that won't always use a'</LI>'Tag for elements of an Ordered or Un-Ordered List.
These types of subtle nuances hint at the commonly-heard phrase "Browser War," and the concept of validity, therefore, is not addressed in this class.
The following example will help explain the use of this method. If an HTML page needs to be checked to see that all elements are properly opened and closed, this method can be used to return a list of any HTML element tag that does not have an equal number of opening and closing tags.
In this example, the generated Java-Doc HTML-Page for classTagNodeis checked.
Example:
String html = FileRW.loadFileToString(htmlFileName); Vector<HTMLNode> v = HTMLPage.getPageTokens(html, false); Hashtable<String, Integer> b = Balance.check(v); StringBuffer sb = new StringBuffer(); // This part just prints a text-output to a string buffer, which is printed to the screen. for (String key : b.keySet()) { Integer i = b.get(key); // Only print keys that had a "non-zero count" // A Non-Zero-Count implies Opening-Tag-Count and Closing-Tag-Count are not equal! if (i.intValue() != 0) sb.append(key + "\t" + i.intValue() + "\n"); } // This example output was: "i -1", because of an unclosed italics element. // NOTE: To find where this unclosed element is, use method: nonNestedCheck(Vector, String)
- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.- Returns:
- A
Hashtablemap of the count of each HTML-Tag present in the inputVector.
For instance, if thisVectorhad five<A HREF=...>(Anchor-Link) tags, and six</A>tags, then the returnedHashtablewould have aString-key equal to"A"with an integer value of-1. - See Also:
FileRW.loadFileToString(String),HTMLPage.getPageTokens(CharSequence, boolean)- Code:
- Exact Method Body:
final Hashtable<String, Integer> ht = new Hashtable<>(); // Iterate through the HTML List, we are only counting HTML Elements, not text or comments for (final Object o : html) if (o instanceof TagNode) { final TagNode tn = (TagNode) o; // Singleton tags are also known as 'self-closing' tags. BR, HR, IMG, etc... if (HTMLTags.isSingleton(tn.tok)) continue; // Current value in the table, or 'null' if this tag hasn't been seen yet. // // An opening-version (TC.OpeningTags, For Instance <DIV ...>) will ADD 1 to the count // A closing-tag (For Instance: </DIV>) will SUBTRACT 1 from the count final Integer I = ht.get(tn.tok); final int updated = ((I == null) ? 0 : I) + // Convert 'null' to Zero; otherwise no-change (tn.isClosing ? -1 : 1); // ClosingTags => -1, OpeningTags => +1 // Update the return result Hashtable for this particular HTML-Element (tn.tok) ht.put(tn.tok, updated); } return ht;
-
check
public static int[] check(java.util.Vector<? super TagNode> html, java.lang.String... htmlTags)
Creates an array that includes an open-and-close'count'for each HTML-Tag / that was requested via the passed inputString[]-Array parameter'htmlTags'.
Browser Validity:
There are some browser-parse advocates who may state that not all HTML Tags have to be closed. For instance, there are plenty of page out there that won't always use a'</LI>'Tag for elements of an Ordered or Un-Ordered List.
These types of subtle nuances hint at the commonly-heard phrase "Browser War," and the concept of validity, therefore, is not addressed in this class.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.htmlTags- This may be one, or many, HTML-Tags whose open-close count needs to be computed. Any HTML Element that is not present in this list - will not have a count computed.
Thecountresults which are stored in anint[]-Array that should be considered "parallel" to this input Var-Args-Array.- Returns:
- An array of the count of each html-element present in the input vectorized-html
parameter
'html'. For instance, If the following values were passed to this method:- A Vectorized-HTML page that had 5
'<SPAN ...>'open-elements, and 6'</SPAN>'closingSPAN-Tags. - And at least one of the
String'sin the Var-Args parameter'htmlTags'was equal to theString"SPAN"(case insensitive). - ==> Then the array-position corresponding to the position in array
'htmlTags'that had the"SPAN"would have a value of'-1'.
- A Vectorized-HTML page that had 5
- Throws:
HTMLTokException- If any of the tags passed are not valid HTML tags.SingletonException- If any of theString-Tags passed to parameter'htmlTags'are'singleton'(Self-Closing) Tags, then this exception throws- Code:
- Exact Method Body:
// Check that these are all valid HTML Tags, throw an exception if not. htmlTags = ARGCHECK.htmlTags(htmlTags); // Temporary Hash-table, used to store the count of each htmlTag final Hashtable<String, Integer> ht = new Hashtable<>(); // Initialize the temporary hash-table. This will be discarded at the end of the method, // and converted into a parallel array. (Parallel to the input String... htmlTags array). // Also, check to make sure the user hasn't requested a count of Singleton HTML Elements. for (final String htmlTag : htmlTags) { if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "One of the tags you have passed: [" + htmlTag + "] is a singleton-tag, " + "and is only allowed opening versions of the tag." ); ht.put(htmlTag, Integer.valueOf(0)); } // Iterate through the HTML List, we are only counting HTML Elements, not text or comments for (final Object o : html) if (o instanceof TagNode) { final TagNode tn = (TagNode) o; // Get the current count from the hash-table final Integer I = ht.get(tn.tok); // The hash-table only holds elements we are counting, if null, then skip. if (I == null) continue; // Save the new, computed count, in the hash-table // // An opening-version (TC.OpeningTags, For Instance <DIV ...>) will ADD 1 to the count // A closing-tag (For Instance: </DIV>) will SUBTRACT 1 from the count // // NOTE: this line of code utilizes Java's Auto-Boxing & Auto-Unboxing features. ht.put(tn.tok, I + (tn.isClosing ? -1 : 1)); } // Convert the hash-table to an integer-array, and return this to the user // Chat-GPT has assured me, by the way, that arrays are always initialized with zeroes. // // No need to set the elements of this array to zero... Ok... I new that. 😊😊 final int[] ret = new int[htmlTags.length]; for (int i=0; i < htmlTags.length; i++) { final Integer I = ht.get(htmlTags[i]); // The assignment part of this 'if' statement is Java's Auto Un-Boxing feature if (I != null) ret[i] = I; } return ret;
-
checkNonZero
public static java.util.Hashtable<java.lang.String,java.lang.Integer> checkNonZero (java.util.Hashtable<java.lang.String,java.lang.Integer> ht)
Creates aHashtablethat has a count of all open and closed HTML-Tags found on the page - whose count-value is not equal to zero.
This method will report when there are unbalanced HTML-Tags on a page, and strictly ignore any & all tags with a count of zero. Specifically, if a tag has a1-to-1open-close count, then it will not have any keys avialable in the returnedHashtable.
Browser Validity:
There are some browser-parse advocates who may state that not all HTML Tags have to be closed. For instance, there are plenty of page out there that won't always use a'</LI>'Tag for elements of an Ordered or Un-Ordered List.
These types of subtle nuances hint at the commonly-heard phrase "Browser War," and the concept of validity, therefore, is not addressed in this class.
Cloned Input:
This method clones the inputHashtableparameter'ht', and removes the elements whose depth was equal zero. This allows the user to perform other operations with the original values contained by the original table, without those changes affecting this method once it has started processing.- Parameters:
ht- This should be aHashtablethat was produced by a call to one of the two availablecheck(...)methods.- Returns:
- A
Hashtablemap of the count of each html-element present in thisVector. For instance, if thisVectorhad 5'<A ...>'(Anchor-Link) elements, and six'</A>'then thisHashtablewould have aString-key'a'with an integer value of'-1'. - Code:
- Exact Method Body:
@SuppressWarnings("unchecked") final Hashtable<String, Integer> ret = (Hashtable<String, Integer>) ht.clone(); final Enumeration<String> keys = ret.keys(); while (keys.hasMoreElements()) { final String key = keys.nextElement(); // Remove any keys (HTML element-names) that have a normal ('0') count. if (ret.get(key).intValue() == 0) ret.remove(key); } return ret;
-
checkTag
public static int checkTag(java.util.Vector<? super TagNode> html, java.lang.String htmlTag)
This will compute acountfor just one, particular, HTML Element of whether that Element has been properly opened and closed. An open and closecount(integer value) will be returned by this method.
Browser Validity:
There are some browser-parse advocates who may state that not all HTML Tags have to be closed. For instance, there are plenty of page out there that won't always use a'</LI>'Tag for elements of an Ordered or Un-Ordered List.
These types of subtle nuances hint at the commonly-heard phrase "Browser War," and the concept of validity, therefore, is not addressed in this class.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.htmlTag- This the html element whose open-close count needs to be kept.- Returns:
- The count of each html-element present in this
Vector. For instance, if the user had requested that HTML Anchor Links be counted, and if the inputVectorhad 5'<A ...>'(Anchor-Link) elements, and six'</A>'then this method would return-1. - Throws:
HTMLTokException- If any of the tags passed are not valid HTML tags.SingletonException- If this'htmlTag'is a'singleton'(Self-Closing) Tag, this exception will throw.- Code:
- Exact Method Body:
// Check that this is a valid HTML Tag, throw an exception if invalid htmlTag = ARGCHECK.htmlTag(htmlTag); if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "The tag you have passed: [" + htmlTag + "] is a singleton-tag, and is only " + "allowed opening versions of the tag." ); // Iterate through the HTML List, we are only counting HTML Elements, not text, and // not HTML Comments TagNode tn; int i = 0; for (final Object o : html) if (o instanceof TagNode) // If we encounter an HTML Element whose tag is the tag whose count we are // computing, then.... if ((tn = (TagNode) o).tok.equals(htmlTag)) // An opening-version (TC.OpeningTags, For Instance <DIV ...>) will ADD 1 to the count // A closing-tag (For Instance: </DIV>) will SUBTRACT 1 from the count i += tn.isClosing ? -1 : 1; return i;
-
depth
public static java.util.Hashtable<java.lang.String,int[]> depth (java.util.Vector<? super TagNode> html)
This method will calculate the "Maximum" and "Minimum" depth for every HTML 5.0 Tag found on a page. The Max-Depth is the "Maximum-Number" of Opening HTML Element Opening Tags were found for a particular element, before a matching closing version of the same Element is encountered. In the example below, the maximum "open-count" for the HTML 'divider' Element (<DIV>) is'2'. This is because a second<DIV>element is opened before the first is closed.
HTML Elements:
<DIV class="MySection"><H1>These are my ideas:</H1> <!-- Above is an outer divider, below is an inner divider --> <DIV class="MyNumbers">Here are the points: <!-- HTML Content Here --> </DIV></DIV>
Browser Validity:
Generally, there are very few elements where the maximum depth should ever be greater than 1. For many standard elements such as the "Anchor Tag" (HTML'<A HREF=...>') having a maximum depth other than 1 would generally be thought of as "Invalid HTML."
What to do about such occurrences shall be left to the programmer. Of course, there are elements that commonly reach a depth greater than 1, for instance:'<SPAN STYLE=...>'tags,<table>tags, and of course any number of nested<DIV>tags.
In such an HTML page, the elements'tr', 'td', 'table'(among others) could all have depths that reach much higher than 1.
'Count' Computation-Heuristic:
This maximum and minimum depth count will not pay any attention to whether HTML open and close tags "enclose each-other" or are "interleaved." The actual mechanics of the for-loop which calculaties thecountshall hopefully explain this computation clearly enough. This may be viewed in this method's hilited source-code, below.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.- Returns:
- The returned
Hashtablewill contain an integer-array for each HTML Element that was found on the page. Each of these arrays shall be of length3.- Minimum Depth:
return_array[0] - Maximum Depth:
return_array[1] - Total Count:
return_array[2]
REDUNDANCY NOTE: The third element of the returned array should be identical to the result produced by an invocation of method:Balance.checkTag(html, htmlTag); - Minimum Depth:
- Throws:
HTMLTokException- If any of the tags passed are not valid HTML tags.SingletonException- throws if'htmlTag'is a 'singleton' (Self-Closing) tag- Code:
- Exact Method Body:
final Hashtable<String, int[]> ht = new Hashtable<>(); // Iterate through the HTML List, we are only counting HTML Elements, not text, and not HTML Comments for (Object o : html) if (o instanceof TagNode) { final TagNode tn = (TagNode) o; // Don't keep a count on singleton tags. if (HTMLTags.isSingleton(tn.tok)) continue; // If this is the first encounter of a particular HTML Element, create a MAX/MIN // integer array, and initialize it's values to zero. int[] curMaxAndMinArr = ht.get(tn.tok); if (curMaxAndMinArr == null) // Current Min Depth Count for Element "tn.tok" is zero // Current Max Depth Count for Element "tn.tok" is zero // Current Computed Depth Count for "tn.tok" is zero ht.put(tn.tok, curMaxAndMinArr = new int[3]); // curCount += tn.isClosing ? -1 : 1; // // An opening-version (TC.OpeningTags, For Instance <DIV ...>) will ADD 1 to the count // A closing-tag (For Instance: </DIV>) will SUBTRACT 1 from the count curMaxAndMinArr[2] += tn.isClosing ? -1 : 1; // If the current depth-count is a "New Minimum" (a new low! :), then save it in the // minimum pos of the output-array. if (curMaxAndMinArr[2] < curMaxAndMinArr[0]) curMaxAndMinArr[0] = curMaxAndMinArr[2]; // If the current depth-count (for this tag) is a "New Maximum" (a new high), save it // to the max-pos of the output-array. if (curMaxAndMinArr[2] > curMaxAndMinArr[1]) curMaxAndMinArr[1] = curMaxAndMinArr[2]; } return ht;
-
depth
public static java.util.Hashtable<java.lang.String,int[]> depth (java.util.Vector<? super TagNode> html, java.lang.String... htmlTags)
This method will calculate the "Maximum" and "Minimum" depth for every HTML Tag listed in thevar-args String[] htmlTagsparameter. The Max-Depth is the "Maximum-Number" of Opening HTML Element Opening Tags were found for a particular element, before a matching closing version of the same Element is encountered. In the example below, the maximum'open-count'for the HTML 'divider' Element (<DIV>) is'2'. This is because a second<DIV>element is opened before the first is closed.
HTML Elements:
<DIV class="MySection"><H1>These are my ideas:</H1> <!-- Above is an outer divider, below is an inner divider --> <DIV class="MyNumbers">Here are the points: <!-- HTML Content Here --> </DIV></DIV>
Browser Validity:
Generally, there are very few elements where the maximum depth should ever be greater than 1. For many standard elements such as the "Anchor Tag" (HTML'<A HREF=...>') having a maximum depth other than 1 would generally be thought of as "Invalid HTML."
What to do about such occurrences shall be left to the programmer. Of course, there are elements that commonly reach a depth greater than 1, for instance:'<SPAN STYLE=...>'tags,<table>tags, and of course any number of nested<DIV>tags.
In such an HTML page, the elements'tr', 'td', 'table'(among others) could all have depths that reach much higher than 1.
'Count' Computation-Heuristic:
This maximum and minimum depth count will not pay any attention to whether HTML open and close tags "enclose each-other" or are "interleaved." The actual mechanics of the for-loop which calculaties thecountshall hopefully explain this computation clearly enough. This may be viewed in this method's hilited source-code, below.
Var-Args Addition:
This method differs from the method with an identical name (defined above) in that it adds aString-VarArgs parameter that allows a user to decide which tags he would like counted and returned in thisHashtable, and which he would like to ignore.
If one of the requested HTML-Tags from thisString-VarArgs parameter is not actually an HTML Element present on the page, the returnedHashtablewill still contain anint[]-Array for that tag. The values in that array will be equal to zero.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.- Returns:
- The returned
Hashtablewill contain an integer-array for each HTML Element that was found on the page. Each of these arrays shall be of length3.- Minimum Depth:
return_array[0] - Maximum Depth:
return_array[1] - Total Count:
return_array[2]
REDUNDANCY NOTE: The third element of the returned array should be identical to the result produced by an invocation of method:Balance.checkTag(html, htmlTag); - Minimum Depth:
- Throws:
HTMLTokException- If any of the tags passed are not valid HTML tags.SingletonException- If this'htmlTag'is a'singleton'(Self-Closing) Tag, this exception will throw.- Code:
- Exact Method Body:
// Check that these are all valid HTML Tags, throw an exception if not. htmlTags = ARGCHECK.htmlTags(htmlTags); final Hashtable<String, int[]> ht = new Hashtable<>(); // Initialize the temporary hash-table. This will be discarded at the end of the method, // and converted into a parallel array. (Parallel to the input String... htmlTags array). // Also, check to make sure the user hasn't requested a count of Singleton HTML Elements. for (final String htmlTag : htmlTags) if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "One of the tags you have passed: [" + htmlTag + "] is a singleton-tag, " + "and is only allowed opening versions of the tag." ); // Insert a new array for this HTML-Tag, Java Auto Initializes array cells to zero else ht.put(htmlTag, new int[3]); // Iterate through the HTML List, we are only counting HTML Elements, not text nor comments for (final Object o: html) if (o instanceof TagNode) { final TagNode tn = (TagNode) o; final int[] curMaxAndMinArr = ht.get(tn.tok); // If this is null, we are attempting to perform the count on an HTML Element that // wasn't requested by the user with the var-args 'String... htmlTags' parameter. // The Hashtable was initialized to only have those tags. (see about 5 lines above // where the Hashtable is initialized) if (curMaxAndMinArr == null) continue; // An opening-version (TC.OpeningTags, For Instance <DIV ...>) will ADD 1 to the count // A closing-tag (For Instance: </DIV>) will SUBTRACT 1 from the count curMaxAndMinArr[2] += tn.isClosing ? -1 : 1; // If the current depth-count is a "New Minimum" (a new low! :), then save it in the // minimum pos of the output-array. if (curMaxAndMinArr[2] < curMaxAndMinArr[0]) curMaxAndMinArr[0] = curMaxAndMinArr[2]; // If the current depth-count (for this tag) is a "New Maximum" (a new high), save it // to the max-pos of the output-array. if (curMaxAndMinArr[2] > curMaxAndMinArr[1]) curMaxAndMinArr[1] = curMaxAndMinArr[2]; // No need to update the hash-table, since this is an array - changing its // values is already "reflected" into the Hashtable. } return ht;
-
depthInvalid
public static java.util.Hashtable<java.lang.String,int[]> depthInvalid (java.util.Hashtable<java.lang.String,int[]> ht)
Creates aHashtablethat has a maximum and minimum depth for all HTML tags found on the page. Any HTML Tags that meet ALL of these criteria shall be removed from the result-setHashtable...- Minimum Depth Is
'0'- i.e. closing tag never precedes opening. - Count is
'0'- i.ei. there is a1-to-1ratio of opening and closing tags for the particular HTML Element.
This means that there is a1:1ratio of opening and closing versions of the tag, and also that there are no positions in the vector where a closing tag to come before an tag to open it.
Cloned Input:
This method clones the original inputHashtable, and removes the tags whose depth-calculations are invalid - as described above. This allows the user to perform other operations with the original table, while this class is processing.- Parameters:
ht- This should be aHashtablethat was produced by a call to one of the two availabledepth(...)methods.- Returns:
- This shall a return a list of HTML Tags that are potentially (but not guaranteed to be) invalid.
- Code:
- Exact Method Body:
@SuppressWarnings("unchecked") final Hashtable<String, int[]> ret = (Hashtable<String, int[]>) ht.clone(); final Enumeration<String> keys = ret.keys(); // Using the "Enumeration" class allows the situation where elements can be removed from // the underlying data-structure - while iterating through that data-structure. This is // not possible using a keySet Iterator. while (keys.hasMoreElements()) { final String key = keys.nextElement(); final int[] arr = ret.get(key); if ((arr[1] >= 0) && (arr[2] == 0)) ret.remove(key); } return ret;
- Minimum Depth Is
-
depthGreaterThanOne
public static java.util.Hashtable<java.lang.String,int[]> depthGreaterThanOne (java.util.Hashtable<java.lang.String,int[]> ht)
Creates aHashtablethat has a maximum and minimum depth for all HTML tags found on the page. Any HTML Tags that meet ALL of these criteria, below, shall be removed from the result-setHashtable...- Maximum Depth is precisely
'1'- i.e. Each element of this tag is closed before a second is open.
Cloned Input: This method clones the original inputHashtable, and removes the tags whose maximum-depth is not greater than one. This allows the user to perform other operations with the original table, while this class is processing.- Parameters:
ht- This should be aHashtablethat was produced by a call to one of the two availabledepth(...)methods.- Returns:
- This shall a return a list of HTML Tags that are potentially (but not guaranteed to be) invalid.
- Code:
- Exact Method Body:
@SuppressWarnings("unchecked") final Hashtable<String, int[]> ret = (Hashtable<String, int[]>) ht.clone(); final Enumeration<String> keys = ret.keys(); // Using the "Enumeration" class allows the situation where elements can be removed from // the underlying data-structure - while iterating through that data-structure. This is not // possible using a keySet Iterator. while (keys.hasMoreElements()) { final String key = keys.nextElement(); final int[] arr = ret.get(key); if (arr[1] == 1) ret.remove(key); } return ret;
- Maximum Depth is precisely
-
depthTag
public static int[] depthTag(java.util.Vector<? super TagNode> html, java.lang.String htmlTag)
This method will calculate the "Maximum" and "Minimum" depth for a particular HTML Tag. The Max-Depth just means the number of Maximum-Number of Opening HTML Element Opening Tags were found, before a matching closing version of the same Element is encountered.
For instance:
<DIV ...><DIV ..> Some Page</DIV></DIV>
has a maximum depth of'2'. This means there is a point in the vectorized-html where there are 2 successive divider elements that are opened, before even one has been closed.
Browser Validity:
Generally, there are very few elements where the maximum depth should ever be greater than 1. For many standard elements such as the "Anchor Tag" (HTML'<A HREF=...>') having a maximum depth other than 1 would generally be thought of as "Invalid HTML."
What to do about such occurrences shall be left to the programmer. Of course, there are elements that commonly reach a depth greater than 1, for instance:'<SPAN STYLE=...>'tags,<table>tags, and of course any number of nested<DIV>tags.
In such an HTML page, the elements'tr', 'td', 'table'(among others) could all have depths that reach much higher than 1.
'Count' Computation-Heuristic:
This maximum and minimum depth count will not pay any attention to whether HTML open and close tags "enclose each-other" or are "interleaved." The actual mechanics of the for-loop which calculaties thecountshall hopefully explain this computation clearly enough. This may be viewed in this method's hilited source-code, below.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.htmlTag- The html element whose maximum and minimum depth-count needs to be computed- Returns:
- The returned integer-array, shall be of length 3.
- Minimum Depth:
return_array[0] - Maximum Depth:
return_array[1] - Total Count:
return_array[2]
REDUNDANCY NOTE: The third element of the returned array should be identical to the result produced by an invocation of method:Balance.checkTag(html, htmlTag); - Minimum Depth:
- Throws:
HTMLTokException- throws if any of the tags passed are not valid HTML tagsSingletonException- throws if'htmlTag'is a 'singleton' (Self-Closing) tag- Code:
- Exact Method Body:
// Check that this is a valid HTML Tag, throw an exception if invalid htmlTag = ARGCHECK.htmlTag(htmlTag); if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "The tag you have passed: [" + htmlTag + "] is a singleton-tag, and is only allowed " + "opening versions of the tag." ); int i = 0, max = 0, min = 0; // Iterate through the HTML List, we are only counting HTML Elements, not text or Comments for (final Object o : html) if (o instanceof TagNode) { final TagNode tn = (TagNode) o; if (! tn.tok.equals(htmlTag)) continue; // An opening "<TABLE ...>" ADDS 1 to the count. A closing-"</TABLE>" SUBTRACTS 1 i += tn.isClosing ? -1 : 1; if (i > max) max = i; if (i < min) min = i; } return new int[] { min, max, i };
-
nonNestedCheck
public static int[] nonNestedCheck(java.util.Vector<? super TagNode> html, java.lang.String htmlTag)
This will find the (likely) places where the "non-nested HTML Elements" have become nested. For the purposes of finding mismatched elements - such as an unclosed "Italics" Element, or an "Extra" Italics Element - this method will find places where a new HTML Tag has opened before a previous one has been closed - or vice-versa (where there is an 'extra' closed-tag).
Certainly, if "nesting" is usually acceptable (for instance the HTML divider'<DIV>...</DIV>'construct) then the results of this method would not have any meaning. Fortunately, for the vast majority of HTML Elements<I>, <B>, <A>, etc...nesting the tags is not allowed or encouraged.
The following example use of this method should make clear the application. If a user has identified that there is an unclosed HTML italics element (<I>...</I>) somewhere on a page, for-example, and that page has numerous italics elements, this method can pinpoint the failure instantly, using this example. Note that the file-name is a Java-Doc generated output HTML file. The documentation for this package received a copious amount of attention due to the sheer number of method-names and class-names used throughout.
Example:
String fStr = FileRW.loadFileToString("javadoc/Torello/HTML/TagNode.html"); Vector<HTMLNode> v = HTMLPage.getPageTokens(fStr, false); int[] posArr = Balance.nonNestedCheck(v, "i"); // Below, the class 'Debug' is used to pretty-print the vectorized-html page. Here, the // output will find the lone, non-closed, HTML italics <I> ... </I> tag-element, and output // it to the terminal-window. The parameter '5' means the nearest 5 elements (in either // direction) are printed, in addition to the elements at the indices in the posArr. // Parameter 'true' implies that two curly braces are printed surrounding the matched node. System.out.println(Debug.print(v, posArr, 5, " Skip a few ", true, Debug::K));
- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.htmlTag- This the html element whose maximum and minimum depth-count was not1and0, respectively. The precise location where the depth achieved either a negative depth, or depth greater than1will be returned in the integer array. In English: When two opening-tags or two closing-tags are identified, successively, then the index where the second tag was found is recorded into the output array.- Returns:
- This will return an array of vectorized-html index-locations / index-pointers where the first
instance of an extra opening, or an extra-closing tag, occurs. This will facilitate finding tags
that are not intended to be nested. If "tag-nesting" (for example HTML divider,
'DIV', elements), then the results returned by this method will not be useful. - Throws:
HTMLTokException- If any of the tags passed are not valid HTML tags.SingletonException- If this'htmlTag'is a'singleton'(Self-Closing) Tag, this exception will throw.- See Also:
FileRW.loadFileToString(String),HTMLPage.getPageTokens(CharSequence, boolean),Debug.print(Vector, int[], int, String, boolean, BiConsumer)- Code:
- Exact Method Body:
// Check that this is a valid HTML Tag, throw an exception if invalid htmlTag = ARGCHECK.htmlTag(htmlTag); if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "The tag you have passed: [" + htmlTag + "] is a singleton-tag, and is only " + "allowed opening versions of the tag." ); // Java Streams are an easier way to keep variable-length lists. They use "builders". // This one is for an "IntStream" final IntStream.Builder b = IntStream.builder(); // Iterate through HTML List, we are only counting HTML Elements, not text nor commeents final int LEN = html.size(); TC last = null; for (int i=0; i < LEN; i++) if (html.elementAt(i) instanceof TagNode) { final TagNode tn = (TagNode) html.elementAt(i); if (! tn.tok.equals(htmlTag)) continue; if ((tn.isClosing) && (last == TC.ClosingTags)) b.add(i); if ((! tn.isClosing) && (last == TC.OpeningTags)) b.add(i); last = tn.isClosing ? TC.ClosingTags : TC.OpeningTags; } return b.build().toArray();
-
locationsAndDepth
public static Ret2<int[],int[]> locationsAndDepth (java.util.Vector<? super TagNode> html, java.lang.String htmlTag)
For likely greater than 95% of HTML tags - finding situations where that tag has 'nested tags' is highly unlikely. Unfortunately, two or three of the most common tags in use, for instance<DIV>, <SPAN>, finding where a mis-match has occurred (tracking down an "Unclosed divider") is an order of magnitude more difficult than finding an unclosed anchor'<A HREF...>'. This method shall return two parallel arrays. The first array will contain vector indices. The second array contains the depth (nesting level) of that tag at that position. In this way, finding an unclosed divider is tantamount to finding where all closing-dividers seem to evaluate to a depth of '1' (one) rather than '0' (zero).This method can highly useful for<SPAN>andDIV, while the "non-standard depth locations" method can be extremely useful for simple, non-nested tags such as Anchor, Paragraph, Section, etc... - HTML Elements that are mostly never nested.
Example:
// Load an HTML File to a String String file = LFEC.loadFile("~/HTML/MyHTMLFile.html"); // Parse, and convert to vectorized-html Vector<HTMLNode> v = HTMLPage.getPageTokens(file, false); // Run this method Ret2<int[], int[]> r = Balance.locationsAndDepth(v, "div"); // This array has vector-indices int[] posArr = (int[]) r.a; // This (parallel) array has the depth at that index. int[] depthArr = (int[]) r.b; for (int i=0; i < posArr.length; i++) System.out.println( "(" + posArr[i] + ", " + depthArr[i] + "):\t" + // Prints the Vector-Index, and Depth C.BRED + v.elementAt(posArr[i]).str + C.RESET // Prints the actual HTML divider. );
The above code would produce a list of HTML Divider elements, along with their index in theVector, and the exact depth (number of nested, open'DIV'elements) at that location. This is usually helpful when trying to find unclosed HTML Tags.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.htmlTag- This the html element that has an imbalanced OPEN-CLOSE ratio in the tree.- Returns:
- Two parallel arrays, as follows:
-
Ret2.a (int[])
This shall be an integer array ofVector-indices where the HTML Element has been found.
-
Ret2.b (int[])
This shall contain an array of the value of the depth for the'htmlTag'at the particularVector-index identified in the first-array.
-
- Throws:
HTMLTokException- throws if any of the tags passed are not valid HTML tags.SingletonException- throws if'htmlTag'is a 'singleton' - Self-Closing Tag- Code:
- Exact Method Body:
// Check that this is a valid HTML Tag, throw an exception if invalid htmlTag = ARGCHECK.htmlTag(htmlTag); if (HTMLTags.isSingleton(htmlTag)) throw new SingletonException( "The tag you have passed: [" + htmlTag + "] is a singleton-tag, and is only " + "allowed opening versions of the tag." ); // Java Streams are an easier way to keep variable-length lists. They use "builders". // These builders are for an "IntStream" final IntStream.Builder locations = IntStream.builder(); final IntStream.Builder depthAtLocation = IntStream.builder(); // Iterate through the HTML List, we are only counting HTML Elements, not text or comments final int LEN = html.size(); int depth = 0; for (int i=0; i < LEN; i++) if (html.elementAt(i) instanceof TagNode) { final TagNode tn = (TagNode) html.elementAt(i); if (! tn.tok.equals(htmlTag)) continue; depth += tn.isClosing ? -1 : 1; locations.add(i); depthAtLocation.add(depth); } return new Ret2<int[], int[]> (locations.build().toArray(), depthAtLocation.build().toArray());
-
toStringDepth
public static java.lang.String toStringDepth (java.util.Hashtable<java.lang.String,int[]> depthReport)
Converts a depth report to aString, for printing.- Parameters:
depthReport- This should be aHashtablereturned by any of the depth-methods.- Returns:
- This shall return the report as a
String. - Code:
- Exact Method Body:
final StringBuilder sb = new StringBuilder(); for (final String htmlTag : depthReport.keySet()) { final int[] arr = depthReport.get(htmlTag); sb.append( "HTML Element: [" + htmlTag + "]:\t" + "Min-Depth: " + arr[0] + ",\tMax-Depth: " + arr[1] + ",\tCount: " + arr[2] + "\n" ); } return sb.toString();
-
toStringBalance
public static java.lang.String toStringBalance (java.util.Hashtable<java.lang.String,java.lang.Integer> balanceCheckReport)
Converts a balance report to aString, for printing.- Parameters:
balanceCheckReport- This should be aHashtablereturned by any of the balance-check methods.- Returns:
- This shall return the report as a
String. - Code:
- Exact Method Body:
final StringBuilder sb = new StringBuilder(); int maxTagLen = 0, maxValStrLen = 0, maxAbsValStrLen = 0; // For good spacing purposes, we need the length of the longest of the tags. for (final String htmlTag : balanceCheckReport.keySet()) if (htmlTag.length() > maxTagLen) maxTagLen = htmlTag.length(); // 17 is the length of the string below, 2 is the amount of extra-space needed maxTagLen += 17 + 2; for (final int v : balanceCheckReport.values()) { final String vStr = "" + v; if (vStr.length() > maxValStrLen) maxValStrLen = vStr.length(); final String absStr = "" + Math.abs(v); if (absStr.length() > maxAbsValStrLen) maxAbsValStrLen = absStr.length(); } int val = 0; for (final String htmlTag : balanceCheckReport.keySet()) sb.append( StringParse.rightSpacePad("HTML Element: [" + htmlTag + "]:", maxTagLen) + StringParse.rightSpacePad( ("" + (val = balanceCheckReport.get(htmlTag).intValue())), maxValStrLen ) + NOTE(val, htmlTag, maxAbsValStrLen) + "\n" ); return sb.toString();
-
toStringBalance
public static java.lang.String toStringBalance (int[] balanceCheckReport, java.lang.String... htmlTags)
Converts a balance report to aString, for printing.- Parameters:
balanceCheckReport- This should be aHashtablereturned by any of the balance-check methods.- Returns:
- This shall return the report as a
String. - Throws:
java.lang.IllegalArgumentException- This exception throws if the length of the two input arrays are not equal. It is imperative that the balance report being printed was created by the html-tags that are listed in the HTML Token var-args parameter. If the two arrays are the same length, but the tags used to create the report Hashtable are not the same ones being passed to the var-args parameter'htmlTags'- the logic will not know the difference, and no exception is thrown.- Code:
- Exact Method Body:
if (balanceCheckReport.length != htmlTags.length) throw new IllegalArgumentException( "The balance report that you are checking was not generated using the html token " + "list provided, they are different lengths. balanceCheckReport.length: " + "[" + balanceCheckReport.length + "]\t htmlTags.length: [" + htmlTags.length + "]" ); final StringBuilder sb = new StringBuilder(); for (int i=0; i < balanceCheckReport.length; i++) sb.append("HTML Element: [" + htmlTags[i] + "]:\t" + balanceCheckReport[i] + "\n"); return sb.toString();
-
-