Package Torello.HTML
This core of the HTML Java Utility Package are three
The
HTML is dealt with using three primary classes that inherit:
The primary impetus for writing these packages was to scrape HTML from web-sites that have content coming in from "overseas." In this "Internet Age" nothing could "feel more useless" - or literally - "be more useless" than reading the tripe and the drivel of a local newspaper telling us about the heroes at the fire-department or the police-department jumping into burning buildings to give us that security that we love and cherish so much. What a bunch of bunk! It causes such corruption in the (former) United States, and makes the lives of the people that really still do live in the (former) United States so much worse. Rather than beginning a long-winded diatribe about how awful the American Government has been, it can be extremely enjoyable, even a lot of fun to try and learn, read, and translate stories from all around the world. Why did we even invent an internet in the first place? To sit around and read stories about the Dunkin' Donuts down the street? Please!
There are as many error checks as can be provided built into this code, and these error-check are included as "exceptions." Read more of the Java Documentation to find out about these exceptions. And finally... The
This packages powers the websites:
It was developed on:
Google Cloud Server, using the "Cloud Shell" Theia interface.
Yes, I have electrodes in my eye-sockets, and my ears, like many many Americans, I am slave, and I hate it. But, I (along with the people hypno-programming me) have written this stuff "Together" with my master. It sucks, read on.
The best way to familiarize with these routines and Java Packages is to download some webpages that in the "Hyper Text Markup Language" and save them as Java Vectors. The class that accomplishes this, the 'primary' java downloader class is:
Example:
HTMLNode
-Object subclasses:
HTMLNode, TagNode
and TextNode.
These classes are extremely light-weight
since they only have at most three fields. The code is kept open (visible) and well-documented
using the JavaDoc Code Documentation Upgrade Tool
. The primary use this entire HTML
downloading, scrape and search package is to provide a way of converting the, sort-of if-you-will
"raw HTML," into more useable and Java Vector<HTMLNode>
and Java Object's
.
These search and scrape routines do not concern themselves too much with validating HTML,
although checking for HTML validity using whatever means you wish should be extremely easy.
Going "Beyond the Browser Wars" usually means that the vast majority of public web-sites largely
contain valid HTML generated by HTML Generation Tools. The real is to either reuse, copy,
modify, or even extract-data-from these websites (particularly foreign-news websites).
The
package Torello.HTML.NodeSearch
at its core, is a small collection of
nearly-identical Java for-loops
that allow a person to "stop worrying" about the
end-point for-loop
checks which are often at the core of any computer-science
programming project. The search-loops are all available to read in the NodeSearch
package. Re-typing such things is *extremely* error-prone which is the real benefit of
this JAR Library.
HTML is dealt with using three primary classes that inherit:
public abstract
HTMLNode
. They are public class TagNode
and also TextNode
and
CommentNode
. There are a few 'extra' classes that may seem to slightly complicate
things: 'TagNodeIndex', 'TextNodeIndex', 'CommentNodeIndex'
and also
'SubSection'
, but although they have "complicated sounding names" what hey
actually help acheive is a provision for returning Node-plus-Index
into a "single
return data class." This (occasionally) makes some search operations easier, where "multiple
return values" would be difficult.
The primary impetus for writing these packages was to scrape HTML from web-sites that have content coming in from "overseas." In this "Internet Age" nothing could "feel more useless" - or literally - "be more useless" than reading the tripe and the drivel of a local newspaper telling us about the heroes at the fire-department or the police-department jumping into burning buildings to give us that security that we love and cherish so much. What a bunch of bunk! It causes such corruption in the (former) United States, and makes the lives of the people that really still do live in the (former) United States so much worse. Rather than beginning a long-winded diatribe about how awful the American Government has been, it can be extremely enjoyable, even a lot of fun to try and learn, read, and translate stories from all around the world. Why did we even invent an internet in the first place? To sit around and read stories about the Dunkin' Donuts down the street? Please!
There are as many error checks as can be provided built into this code, and these error-check are included as "exceptions." Read more of the Java Documentation to find out about these exceptions. And finally... The
public class 'PageStats'
if you look
closely, should provide as much example information as is possible about what the search
subroutines in this scrape package actually accomplish. There is also a "Work Book" class
called public class 'Elements'
in the package 'Tools.' There, the documentation, too,
should do as much explaining as is possible about how to use these search and scrape routines.
This packages powers the websites:
It was developed on:
Google Cloud Server, using the "Cloud Shell" Theia interface.
Yes, I have electrodes in my eye-sockets, and my ears, like many many Americans, I am slave, and I hate it. But, I (along with the people hypno-programming me) have written this stuff "Together" with my master. It sucks, read on.
The best way to familiarize with these routines and Java Packages is to download some webpages that in the "Hyper Text Markup Language" and save them as Java Vectors. The class that accomplishes this, the 'primary' java downloader class is:
HTMLPage
. Below are
a few basic example uses of this class.
Example:
// Download and Parse the HTML on a web-site Vector<HTMLNode> webPage = HTMLPage.getPageTokens(new java.net.URL("http://some.url.com"), false); // Save the HTML to a file: FileRW.writeFile(Util.pageToString(webPage), "MyFile.html"); // Print out the HTML <A> (Anchor Links): for (TagNode tn : TagNodeGet.all(webPage, TC.OpeningTags, "a")) System.out.println(tn.str); // Find and print text for (HTMLNode n : webPage) if (n.isTextNode()) if (n.str.contains("My Search Text")) System.out.println(n.str);
-
Interfaces Java Entity Description Attributes.Filter Lambda-target for creating attribute-filtersReplaceFunction A function-pointer definition that facilitates the substituting ofHTMLNode
elements in Vectorized-HTML with other, user-provided, elementsURLFilter A simple lambda-target which extendsPredicate<URL>
Core HTML Data Classes Java Entity Description HTMLNode This class is mostly a wrapper for classjava.lang.String
, and serves as the abstract parent of the three types of HTML elements offered by the Java HTML LibraryCommentNode Represents HTML Comments, and is one of only three HTML Element Classes provided by the Java HTML Library Tool, and also one of the three data-classes that can be generated by the HTML ParserTagNode Represents an HTML Element Tag, and is the flagship class of the Java-HTML LibraryTextNode Represents document text, and is one of only three HTML Element Classes provided by the Java HTML Library Tool, and also oneof the three classes that can be generated by the HTML ParserDotPair A simple utility class that, used ubiquitously throughout Java HTML, which maintains two integer fields -DotPai.start
andDotPai.end
, for demarcating the begining and ending of a sub-list within an HTML web-pagePageStats Computes miscellaneous statistics for a web-page, or sub-pageReplaceable, Efficient Transform Java Entity Description Replaceable The classReplaceNodes
offers a great efficiency-improvement optimization for modifying vectorized-HTMLNodeIndex<NODE extends HTMLNode> The abstract parent class of all threeNodeIndex
classes,TagNodeIndex
,TextNodeIndex
andCommentNodeIndex
CommentNodeIndex This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both aCommentNode
instance, and aVector
-index location (for that node) - at the same time - when searching HTML web-pages for HTML commentsTagNodeIndex This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both aTagNode
instance, and aVector
-index location (for that node) - at the same time - when searching HTML web-pages for HTML tagsTextNodeIndex This is a simple data-class whose primary reason for development was to provide a way for the NodeSearch Package classes to simultaneously return both aTextNode
instance, and aVector
-index location (for that node) - at the same time - when searching HTML web-pages for document-textSubSection Allows the NodeSearch Package to simultaneously return both an HTML-Vector
sublist, and the location where that sub-list was located (as an instance ofDotPair
) where that sublist was locatedParse & Scrape Classes Java Entity Description HTMLPage Java HTML's flagship-parser class for converting HTML web-pages into plain JavaVector's
ofHTMLNode
HTMLPageMWT A carbon-copy of classHTMLPage
, augmented with a mechanism for setting a timeout so that when scraping web-pages andURL's
from servers that might have a tendency to hang, freeze, or delay - the Java Virtual Machine can skip and move-on when that timeout expiresHTMLPage.Parser A function-pointer / lambda-target that (could) potentially be used to replace this library's current regular-expression based parser with something possibly faster or even more efficientScrape Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a JavaString
- which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using classHTMLPage
)HTMLTags Primary "HTML-5 Tags" class - keeps a list of all122 Tags
in aTreeSet<String>
, and many accessor methods that are used by he HTML Parser, or potentially any class or function that may need this listHTTPCodes Keeps lists of the various HTTP Response Codes, made available asString
-arrays andIterator's
for tasks such as building better exception messages or printing for use as referenceHTML Processing-Classes Java Entity Description Attributes Utilities for getting, setting and removing attributes from theTagNode
elements in a Web-PageVector
Balance Utilities for checking that opening and closingTagNode
elements match up (that the HTML is balanced)Debug A fast way to print Web-PageVector's
to aString
DPUtil Escape Easy utilities for escaping and un-escaping HTML characters such as
, and even code-point based Emoji'sFeatures Tools to retrieve and insert tags into the<HEAD>
of a web-pageFeatures.Meta Tools made specifically for the<META>
tags in the<HEAD>
of a web-pageInnerTags "Inner-Tags", a synonym for "Attributes" allows a user to do some aggregrate searches for the types of attributes in Vectorized-HTMLLinks Utilities for de-refrencing 'partially-completed'URL's
in a Web-PageVector
Listeners A basic tool for finding Java-Script Listener Attributes in theTagNode
elements in a Vectorized-HTML Web-PageReplaceNodes Methods for quickly & efficiently replacing the nodes on a Web-PageSplashBridge Demonstrates using 'Splash,' which is one of many ways to execute the Java-Script on Web-Pages, before those pages are parsedSurrounding Class for finding ancestor & parent nodes of any selectedHTMLNode
EXPORT_PORTAL Util A long list of utilities for searching, finding, extracting and removing HTML from Vectorized-HTMLUtil.Count Util.Remove Util.Inclusive Tools for finding the matching-closing tag of any openTagNode
Enums Java Entity Description AUM AUM (Attribute Update Mode) - Documentation
MetaTagName The values & types of HTML 'NAME'<META>
-TagsRobots An enumeration of the standard Search Engine index-directivesSD Single-Quotes & Double-Quotes - An enumeration of the two types of quotes used inside HTML Tags for assigning values to tag-attributes
TC Tag Criteria - An enumeration that allows a user to specify which type of tag's are being requested when searching a web-page for HTML-TagsExceptions Java Entity Description ClosingTagNodeException This throws when attempting to write attributes to a closingTagNode
(because closing tag's may not have attributes),
Valid HTML does not allow "Inner Tags" or "Attributes" to be inserted into HTML Elements that begin with the forward-slash (ASCII Character:'/'
)HTMLTokException Used when an invalid, non-HTML, tag or token has been passed to a methodMalformedTagNodeException If, when attempting to instantiate or construct aTagNode
, theString
used to instantiate that node is invalid, this exception will be thrown to inform the programmer that his passed constructor-String
was invalidQuotesException ThisException
is generated, usually, when a quote-within-quote problem has occurred inside HTML AttributesScrapeException Thrown by classScrape
when a scrape-download failsSingletonException An exception that's mostly identical to theclass InclusiveException
, but is thrown only when attempting to instantiate a Singleton HTML Element with a closing-tag forward-slashMalformedHTMLException This Exception may be thrown by code that checks the validity of an HTML PageVector
BalancedHTMLException Used to report unbalanced HTML Tags within any Vectorized-HTML page or sub-pageHREFException Used to identify problems parsing or searching an'HREF'
attribute from an HTML'<A HREF=...>
(Anchor) Tag or any Tag that is expected to contain an'HREF'
attributeSRCException Used to identify problems parsing or searching a'SRC'
attribute from an HTML'<IMG SRC=...>
Tag or any Tag that is expected to contain a'SRC'
attributeInnerTagKeyException This occurs whenever a parameter specifies an Inner-Tag "Key-Value Pair" (which, in this package, are also known as Attribute-Value Pairs) that contain inappropriate characters inside thekey-String
InnerTagValueException This class is not used internally, but is intended to be used to check for invalid attribute-values inside HTML TagsReplaceableException In order for a set or collection of<
to work properly during an update or replacement, the set or collection (Replaceable
>Iterable
) ofReplaceable's
must be properly sorted and the pieces may not overlap (as per their original-locations)ReplaceableOutOfBoundsException Thrown during aReplaceable
HTML-Vector
update operation when aReplaceable
instance is not within the bounds of that HTML-Vector
ReplaceablesOverlappingException Thrown during aReplaceable
HTML-Vector
update operation when two consecutiveReplaceable
instances appear to point to overlapping indices within theVector
ReplaceablesUnsortedException Thrown during aReplaceable
HTML-Vector
update operation when two consecutiveReplaceable
instances appear to be out of orderNodeExpectedException This can be used to catch different types of exceptions using the same code-branch, since it is the parent-class of several 'Node Expected Exceptions'NodeNotFoundException If a programmer is writing code, and expecting an HTML-PageVector
position to contain a specific type ofHTMLNode
, and it is not found anywhere on that page, sub-page or sub-section, then this exception may be usedTagNodeExpectedException If an HTML-Page Vector index-position is expected to contain aTagNode
but it turns out to be aninstanceof TextNode
or, possibly,CommentNode
- then this exception should throwTextNodeExpectedException If an HTML-Page Vector index-position is expected to contain aTextNode
but it turns out to be aninstanceof TagNode
or, possibly,CommentNode
- then this exception should throwOpeningTagNodeExpectedException If an HTML-PageVector
index-position should contain aTagNode
whoseTagNode.isClosing
field is setFALSE
, but that field isTRUE
, then this exception should throwClosingTagNodeExpectedException If a programmer is expecting an HTML-PageVector
position-index to contain aTagNode
whoseTagNode.isClosing
field to be set toTRUE
and it is not, then this exception should be thrown