Package Torello.HTML

This core of the HTML Java Utility Package are three HTMLNode-Object subclasses: HTMLNode, TagNode and TextNode. These classes are extremely light-weight since they only have at most three fields. The code is kept open (visible) and well-documented using the JavaDoc Code Documentation Upgrade Tool. The primary use this entire HTML downloading, scrape and search package is to provide a way of converting the, sort-of if-you-will "raw HTML," into more useable and Java Vector<HTMLNode> and Java Object's. These search and scrape routines do not concern themselves too much with validating HTML, although checking for HTML validity using whatever means you wish should be extremely easy. Going "Beyond the Browser Wars" usually means that the vast majority of public web-sites largely contain valid HTML generated by HTML Generation Tools. The real is to either reuse, copy, modify, or even extract-data-from these websites (particularly foreign-news websites).

The package Torello.HTML.NodeSearch at its core, is a small collection of nearly-identical Java for-loops that allow a person to "stop worrying" about the end-point for-loop checks which are often at the core of any computer-science programming project. The search-loops are all available to read in the NodeSearch package. Re-typing such things is *extremely* error-prone which is the real benefit of this JAR Library.

HTML is dealt with using three primary classes that inherit: public abstract HTMLNode. They are public class TagNode and also TextNode and CommentNode. There are a few 'extra' classes that may seem to slightly complicate things: 'TagNodeIndex', 'TextNodeIndex', 'CommentNodeIndex' and also 'SubSection', but although they have "complicated sounding names" what hey actually help acheive is a provision for returning Node-plus-Index into a "single return data class." This (occasionally) makes some search operations easier, where "multiple return values" would be difficult.

The primary impetus for writing these packages was to scrape HTML from web-sites that have content coming in from "overseas." In this "Internet Age" nothing could "feel more useless" - or literally - "be more useless" than reading the tripe and the drivel of a local newspaper telling us about the heroes at the fire-department or the police-department jumping into burning buildings to give us that security that we love and cherish so much. What a bunch of bunk! It causes such corruption in the (former) United States, and makes the lives of the people that really still do live in the (former) United States so much worse. Rather than beginning a long-winded diatribe about how awful the American Government has been, it can be extremely enjoyable, even a lot of fun to try and learn, read, and translate stories from all around the world. Why did we even invent an internet in the first place? To sit around and read stories about the Dunkin' Donuts down the street? Please!

There are as many error checks as can be provided built into this code, and these error-check are included as "exceptions." Read more of the Java Documentation to find out about these exceptions. And finally... The public class 'PageStats' if you look closely, should provide as much example information as is possible about what the search subroutines in this scrape package actually accomplish. There is also a "Work Book" class called public class 'Elements' in the package 'Tools.' There, the documentation, too, should do as much explaining as is possible about how to use these search and scrape routines.

This packages powers the websites:

Chinese News Board
Spanish News Board


It was developed on:

Google Cloud Server, using the "Cloud Shell" Theia interface.
Yes, I have electrodes in my eye-sockets, and my ears, like many many Americans, I am slave, and I hate it. But, I (along with the people hypno-programming me) have written this stuff "Together" with my master. It sucks, read on.

The best way to familiarize with these routines and Java Packages is to download some webpages that in the "Hyper Text Markup Language" and save them as Java Vectors. The class that accomplishes this, the 'primary' java downloader class is: HTMLPage. Below are a few basic example uses of this class.

Example:
 // Download and Parse the HTML on a web-site
 Vector<HTMLNode> webPage = 
      HTMLPage.getPageTokens(new java.net.URL("http://some.url.com"), false);
 
 // Save the HTML to a file:
 FileRW.writeFile(Util.pageToString(webPage), "MyFile.html");
 
 // Print out the HTML <A> (Anchor Links):
 for (TagNode tn : TagNodeGet.all(webPage, TC.OpeningTags, "a"))
      System.out.println(tn.str);

 // Find and print text
 for (HTMLNode n : webPage)
      if (n.isTextNode())
          if (n.str.contains("My Search Text"))
              System.out.println(n.str);