Torello.HTML

This core of the HTML Java Utility Package are three HTMLNode-Object subclasses: HTMLNode, TagNode and TextNode. These classes are extremely light-weight since they only have at most three fields. The code is kept open (visible) and well-documented using the JavaDoc Code Documentation Upgrade Tool. The primary use this entire HTML downloading, scrape and search package is to provide a way of converting the, sort-of if-you-will "raw HTML," into more useable and Java Vector<HTMLNode> and Java Object's. These search and scrape routines do not concern themselves too much with validating HTML, although checking for HTML validity using whatever means you wish should be extremely easy. Going "Beyond the Browser Wars" usually means that the vast majority of public web-sites largely contain valid HTML generated by HTML Generation Tools. The real is to either reuse, copy, modify, or even extract-data-from these websites (particularly foreign-news websites).

The package Torello.HTML.NodeSearch at its core, is a small collection of nearly-identical Java for-loops that allow a person to "stop worrying" about the end-point for-loop checks which are often at the core of any computer-science programming project. The search-loops are all available to read in the NodeSearch package. Re-typing such things is *extremely* error-prone which is the real benefit of this JAR Library.

HTML is dealt with using three primary classes that inherit:

public abstract
 HTMLNode

. They are public class TagNode and also TextNode and CommentNode. There are a few 'extra' classes that may seem to slightly complicate things: 'TagNodeIndex', 'TextNodeIndex', 'CommentNodeIndex' and also 'SubSection', but although they have "complicated sounding names" what hey actually help acheive is a provision for returning Node-plus-Index into a "single return data class." This (occasionally) makes some search operations easier, where "multiple return values" would be difficult.

The primary impetus for writing these packages was to scrape HTML from web-sites that have content coming in from "overseas." In this "Internet Age" nothing could "feel more useless" - or literally - "be more useless" than reading the tripe and the drivel of a local newspaper telling us about the heroes at the fire-department or the police-department jumping into burning buildings to give us that security that we love and cherish so much. What a bunch of bunk! It causes such corruption in the (former) United States, and makes the lives of the people that really still do live in the (former) United States so much worse. Rather than beginning a long-winded diatribe about how awful the American Government has been, it can be extremely enjoyable, even a lot of fun to try and learn, read, and translate stories from all around the world. Why did we even invent an internet in the first place? To sit around and read stories about the Dunkin' Donuts down the street? Please!

There are as many error checks as can be provided built into this code, and these error-check are included as "exceptions." Read more of the Java Documentation to find out about these exceptions. And finally... The public class 'PageStats' if you look closely, should provide as much example information as is possible about what the search subroutines in this scrape package actually accomplish. There is also a "Work Book" class called public class 'Elements' in the package 'Tools.' There, the documentation, too, should do as much explaining as is possible about how to use these search and scrape routines.

This packages powers the websites:

Chinese News Board
Spanish News Board

It was developed on:

Google Cloud Server, using the "Cloud Shell" Theia interface.
Yes, I have electrodes in my eye-sockets, and my ears, like many many Americans, I am slave, and I hate it. But, I (along with the people hypno-programming me) have written this stuff "Together" with my master. It sucks, read on.

The best way to familiarize with these routines and Java Packages is to download some webpages that in the "Hyper Text Markup Language" and save them as Java Vectors. The class that accomplishes this, the 'primary' java downloader class is: HTMLPage. Below are a few basic example uses of this class.

Example:

 // Download and Parse the HTML on a web-site
 Vector<HTMLNode> webPage = 
      HTMLPage.getPageTokens(new java.net.URL("http://some.url.com"), false);
 
 // Save the HTML to a file:
 FileRW.writeFile(Util.pageToString(webPage), "MyFile.html");
 
 // Print out the HTML <A> (Anchor Links):
 for (TagNode tn : TagNodeGet.all(webPage, TC.OpeningTags, "a"))
      System.out.println(tn.str);

 // Find and print text
 for (HTMLNode n : webPage)
      if (n.isTextNode())
          if (n.str.contains("My Search Text"))
              System.out.println(n.str);

Interfaces

Java Entity	Description
Attributes.Filter	Lambda-target for creating attribute-filters
ReplaceFunction	A function-pointer definition that facilitates the substituting of `HTMLNode` elements in Vectorized-HTML with other, user-provided, elements
URLFilter	A simple lambda-target which extends `Predicate<URL>`

Core HTML Data Classes
Java Entity	Description
HTMLNode	This class is mostly a wrapper for class `java.lang.String`, and serves as the abstract parent of the three types of HTML elements offered by the Java HTML Library

CommentNode	Represents HTML Comments, and is one of only three HTML Element Classes provided by the Java HTML Library Tool, and also one of the three data-classes that can be generated by the HTML Parser
TagNode	Represents an HTML Element Tag, and is the flagship class of the Java-HTML Library
TextNode	Represents document text, and is one of only three HTML Element Classes provided by the Java HTML Library Tool, and also oneof the three classes that can be generated by the HTML Parser

DotPair	A simple utility class that, used ubiquitously throughout Java HTML, which maintains two integer fields - `DotPai.start` and `DotPai.end` , for demarcating the begining and ending of a sub-list within an HTML web-page
PageStats	Computes miscellaneous statistics for a web-page, or sub-page

Replaceable, Efficient Transform
Java Entity	Description
Replaceable	The class `ReplaceNodes` offers a great *efficiency-improvement* optimization for modifying vectorized-HTML

NodeIndex<NODE extends HTMLNode>	The abstract parent class of all three `NodeIndex` classes, `TagNodeIndex`, `TextNodeIndex` and `CommentNodeIndex`
CommentNodeIndex	This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both a `CommentNode` instance, and a `Vector`-index location (for that node) - at the same time - when searching HTML web-pages for HTML comments
TagNodeIndex	This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both a `TagNode` instance, and a `Vector`-index location (for that node) - at the same time - when searching HTML web-pages for HTML tags
TextNodeIndex	This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both a `TextNode` instance, and a `Vector`-index location (for that node) - at the same time - when searching HTML web-pages for document-text

SubSection	Allows the NodeSearch Package to simultaneously return both an HTML-`Vector` sublist, and the location where that sub-list was located (as an instance of `DotPair`) where that sublist was located

Parse & Scrape Classes
Java Entity	Description
HTMLPage	Java HTML's flagship-parser class for converting HTML web-pages into plain Java `Vector's` of `HTMLNode`
HTMLPageMWT	A carbon-copy of class `HTMLPage`, augmented with a mechanism for setting a timeout so that when scraping web-pages and `URL's` from servers that might have a tendency to hang, freeze, or delay - the Java Virtual Machine can skip and move-on when that timeout expires
HTMLPage.Parser	A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficient
Scrape	Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a Java `String` - which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using class `HTMLPage`)
HTMLTags	Primary "HTML-5 Tags" class - keeps a list of all `122 Tags` in a `TreeSet<String>`, and many accessor methods that are used by he HTML Parser, or potentially any class or function that may need this list
HTTPCodes	Keeps lists of the various HTTP Response Codes, made available as `String`-arrays and `Iterator's` for tasks such as building better exception messages or printing for use as reference

HTML Processing-Classes
Java Entity	Description
Attributes	Utilities for getting, setting and removing attributes from the `TagNode` elements in a Web-Page `Vector`
Balance	Utilities for checking that opening and closing `TagNode` elements match up (that the HTML is balanced)
Debug	A fast way to print Web-Page `Vector's` to a `String`
DPUtil
Escape	Easy utilities for escaping and un-escaping HTML characters such as ` `, and even code-point based Emoji's
Features	Tools to retrieve and insert tags into the `<HEAD>` of a web-page
Features.Meta	Tools made specifically for the `<META>` tags in the `<HEAD>` of a web-page
InnerTags	"Inner-Tags", a synonym for "Attributes" allows a user to do some aggregrate searches for the types of attributes in Vectorized-HTML
Links	Utilities for de-refrencing 'partially-completed' `URL's` in a Web-Page `Vector`
Listeners	A basic tool for finding Java-Script Listener Attributes in the `TagNode` elements in a Vectorized-HTML Web-Page
ReplaceNodes	Methods for quickly & efficiently replacing the nodes on a Web-Page
SplashBridge	Demonstrates using 'Splash,' which is one of many ways to execute the Java-Script on Web-Pages, before those pages are parsed
Surrounding	Class for finding ancestor & parent nodes of any selected `HTMLNode`
EXPORT_PORTAL

Util	A long list of utilities for searching, finding, extracting and removing HTML from Vectorized-HTML
Util.Count
Util.Remove
Util.Inclusive	Tools for finding the matching-closing tag of any open `TagNode`

Enums
Java Entity	Description
AUM	`AUM (Attribute Update Mode) - Documentation`
MetaTagName	The values & types of HTML 'NAME' `<META>`-Tags
Robots	An enumeration of the standard Search Engine index-directives
SD	Single-Quotes & Double-Quotes - An enumeration of the two types of quotes used inside HTML Tags for assigning values to tag-`attributes`
TC	Tag Criteria - An enumeration that allows a user to specify which type of tag's are being requested when searching a web-page for HTML-Tags

Exceptions
Java Entity	Description
ClosingTagNodeException	This throws when attempting to write attributes to a closing `TagNode` (because closing tag's may not have attributes), Valid HTML does not allow "Inner Tags" or "Attributes" to be inserted into HTML Elements that begin with the forward-slash (ASCII Character: `'/'`)
HTMLTokException	Used when an invalid, non-HTML, tag or token has been passed to a method
MalformedTagNodeException	If, when attempting to instantiate or construct a `TagNode`, the `String` used to instantiate that node is invalid, this exception will be thrown to inform the programmer that his passed constructor-`String` was invalid
QuotesException	This `Exception` is generated, usually, when a quote-within-quote problem has occurred inside HTML Attributes
ScrapeException	Thrown by class `Scrape` when a scrape-download fails
SingletonException	An exception that's mostly identical to the `class InclusiveException`, but is thrown only when attempting to instantiate a Singleton HTML Element with a closing-tag forward-slash

MalformedHTMLException	This Exception may be thrown by code that checks the validity of an HTML Page `Vector`
BalancedHTMLException	Used to report unbalanced HTML Tags within any Vectorized-HTML page or sub-page

HREFException	Used to identify problems parsing or searching an `'HREF'` attribute from an HTML `'<A HREF=...>` (Anchor) Tag or any Tag that is expected to contain an `'HREF'` attribute
SRCException	Used to identify problems parsing or searching a `'SRC'` attribute from an HTML `'<IMG SRC=...>` Tag or any Tag that is expected to contain a `'SRC'` attribute

InnerTagKeyException	This occurs whenever a parameter specifies an Inner-Tag "Key-Value Pair" (which, in this package, are also known as Attribute-Value Pairs) that contain inappropriate characters inside the `key-String`
InnerTagValueException	This class is not used internally, but is intended to be used to check for invalid attribute-values inside HTML Tags

ReplaceableException	In order for a set or collection of `<Replaceable>` to work properly during an update or replacement, the set or collection (`Iterable`) of `Replaceable's` must be properly sorted and the pieces may not overlap (as per their *original-locations*)
ReplaceableOutOfBoundsException	Thrown during a `Replaceable` HTML-`Vector` update operation when a `Replaceable` instance is not within the bounds of that HTML-`Vector`
ReplaceablesOverlappingException	Thrown during a `Replaceable` HTML-`Vector` update operation when two consecutive `Replaceable` instances appear to point to overlapping indices within the `Vector`
ReplaceablesUnsortedException	Thrown during a `Replaceable` HTML-`Vector` update operation when two consecutive `Replaceable` instances appear to be out of order

NodeExpectedException	This can be used to catch different types of exceptions using the same code-branch, since it is the parent-class of several 'Node Expected Exceptions'
NodeNotFoundException	If a programmer is writing code, and expecting an HTML-Page `Vector` position to contain a specific type of `HTMLNode`, and it is not found anywhere on that page, sub-page or sub-section, then this exception may be used
TagNodeExpectedException	If an HTML-Page Vector index-position is expected to contain a `TagNode` but it turns out to be an `instanceof TextNode` or, possibly, `CommentNode` - then this exception should throw
TextNodeExpectedException	If an HTML-Page Vector index-position is expected to contain a `TextNode` but it turns out to be an `instanceof TagNode` or, possibly, `CommentNode` - then this exception should throw
OpeningTagNodeExpectedException	If an HTML-Page `Vector` index-position should contain a `TagNode` whose `TagNode.isClosing` field is set `FALSE`, but that field is `TRUE`, then this exception should throw
ClosingTagNodeExpectedException	If a programmer is expecting an HTML-Page `Vector` position-index to contain a `TagNode` whose `TagNode.isClosing` field to be set to `TRUE` and it is not, then this exception should be thrown

Package Torello.HTML