Package Torello.HTML
Class Links
- java.lang.Object
-
- Torello.HTML.Links
-
public class Links extends java.lang.Object
Utilities for de-refrencing 'partially-completed'URL'sin a Web-PageVector.
This is a utility class that helps 'complete' URLs that are often scraped from web-pages, and are 'relative' (partially completed) URLs. This is a common occurrence in browsers, when people do not need to present an entire directory and web-server DNS name for retrieving an image file or link that resides in the same directory as the web-page URL of the page in which that link resides.
Content Note:
These scrape-package classes were initially developed for scraping news-content from the Chinese Government Web-Portal, and redirecting over-seas news-content to a simple translation service for people interested in reading about news from over-seas. This is particularly interesting for a government such as China, were a huge percentage of our economic GDP based on products exported from factories in the Southern Region there to our strip-malls here in Dallas (and other places). Perhaps these URL examples may not seem relevant to a typical Internet-Programmer who is not presently studying languages, but they are staying here anyway.
Specifically: In addition to Java - Chinese, Spanish, German etc... are also interesting languages to study.
Exception Supression:
Precisely half of these methods are designed to "sweep" an entire page of HTML. The methods that expect an vector of anchors, images, or other links and iterate over the entire HTML-Vectoror page will catch any and all exception-throws of typeMalformedURLException, and placenullin the return-Vectorposition for that particular URL.
The value of this is, of course, that all links that can be resolved, by the nature of exception-suppression, will be resolved. Checking the return-Vector'sfor null-values is necessary when pages that contain broken links or image-sources is important. However, each method that ends with the letter 'KE' shall return aVectorthat includes any thrown exception in the Java-HTML Tuple-ClassRet2<URL, MalformedURLException>.
This concept may seem 'unique,' but once this process is familiar - the value of not being forced to writetry-catchblocks for every web-pageURL-resolution-stage in your programs will hopefully become obvious.
Example Table:
The following table attempts to explain the rules for evaluating relative / partialURL's, such as an HTML'<A ...>'(Anchor-Tag)'HREF=...' URL, or an<IMG SRC="..."> URL. The column on the left portrays the type ofTagNode-input containing a URL - which could be a partialURL- while the column on the right hopefully demystifies how such aURLwould be "decoded" (de-referenced) from a partial to a completeUniform Resource Locator.HTML TagNode sourceURL:
http://english.gov.CN/article/01-01-2018/index.html<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.bmp">http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="/article/01-01-2018/image12345.bmp">http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="image12345.bmp">http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="//some.other.url/a.bmp">http://some.other.url/a.bmp <A HREF="#sub-section">null<IMG SRC="../../pic2.bmp">http://english.gov.CN/pic2.bmp <A HREF="tel: (212) 555-6789">nullHTML TagNode sourceURL:
http://english.gov.CN/article/12-31-2018/index.html<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.png">http://english.gov.CN/article/01-01-2018/image12345.png <IMG SRC="/article/01-01-2018/image12345.png">http://english.gov.CN/article/01-01-2018/image12345.png <IMG SRC="image12345.png">http://english.gov.CN/article/12-31-2018/image12345.png <IMG SRC="//some.other.url/a.bmp">http://some.other.url/a.bmp <A HREF="#sub-section">null<IMG SRC="../pic3.bmp">http://english.gov.CN/article/pic3.bmp <A HREF="mailto: [email protected]">nullHTML TagNode sourceURL:
http://SpanishNewsBoard.com/article/10-12-2018/index.html<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.jpg">http://english.gov.CN/article/01-01-2018/image12345.jpg <IMG SRC="/article/01-01-2018/image12345.jpg">http://SpanishNewsBoard.com/article/01-01-2018/image12345.jpg <IMG SRC="image12345.jpg">http://SpanishNewsBoard.com/article/10-12-2018/image12345.jpg <IMG SRC="//some.other.url/a.bmp">http://some.other.url/a.bmp <A HREF="#sub-section">null<IMG SRC="../../../pic3.bmp">null<A HREF="javascript: alert("hello world);">null
The following example will find all HTML<A HREF="...">(anchor-tags), and replace theHREFvalue it finds with an absolute url-link
Example:
// This fixes the body of a "web-page news-article" (or any web-site html, so to speak) // It assures that (after scraping) any original Anchor URL's which contained "relative links" // become "absolute links" - by completing the URL. // The original web-site url URL webSiteURL = new URL("https://some-web-site.com/News/Article-Numero-Uno.html"); // Here the HTML page is downloaded to a simple Java Vector. Vector<HTMLNode> page = HTMLPage.getPageTokens(webSiteURL, false); // Any URL's which do not contain complete URI's - inclusive of a domain-name, directory, // and file-name will be completed and inserted back into the page. Links.resolveAllHREF(page, webSiteURL, SD.SingleQuotes, false);
COMMON SPECIAL CASES:
The following special cases for commonly foundHREF-Attributes includeURL-Links that are not intended to point to HTML pages. The following rather commonly found values for HTML Anchor TagHREF-Attributes that will cause this class to return null and/or return an exception include these:<A HREF="tel:<a-telephone-number>" ... ><A HREF="javascript:<some-script-calls>" ... ><A HREF="mailto:<an-email-address>" ... ><A HREF="file:<file-for-download>" ... ><A HREF="ftp:<ftp-file-transfer-protocol-address>" ... ><A HREF="magnet:<bit-torrent-address>" ...><A HREF="data:<base64-encoded-image>" ... ><A HREF="blob:<Binary-Large-Object>" ... ><A HREF="#<this-page-subsection>" ... >
Any call to resolve an HTML Anchor element whose URL link begins with the above special-cases will return null, or, if the "Keep Exception" (_KE) version is requested aTorello.Java.Ret2<URL, HREFException>will be returned where the value ofret2.ais null, and the value ofret2.bis an instance of anHREFException- See Also:
ReplaceNodes,ReplaceFunction,HTMLPage,InnerTagFind,Ret2
Hi-Lited Source-Code:- View Here: Torello/HTML/Links.java
- Open New Browser-Tab: Torello/HTML/Links.java
File Size: 58,204 Bytes Line Count: 1,378 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@StatelessAnnotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 26 Method(s), 26 declared static
- 1 Field(s), 1 declared static, 1 declared final
-
-
Field Summary
Fields Modifier and Type Field protected static String[]_NON_URL_HREFS
-
Method Summary
Resolve URL's Modifier and Type Method static URLresolve(String src, URL sourcePage)static Vector<URL>resolve(Vector<String> src, URL sourcePage)Resolve URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolve_KE(String src, URL sourcePage)static Vector<Ret2<URL,
MalformedURLException>>resolve_KE(Vector<String> src, URL sourcePage)Resolve HREF-Attribute URL's Modifier and Type Method static URLresolveHREF(TagNode tnWithHREF, URL sourcePage)static TagNoderesolveHREFAndUpdate(TagNode tnWithHREF, URL sourcePage)static Vector<URL>resolveHREFs(Iterable<TagNode> tnListWithHREF, URL sourcePage)static Vector<URL>resolveHREFs(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)Resolve SRC-Attribute URL's Modifier and Type Method static URLresolveSRC(TagNode tnWithSRC, URL sourcePage)static TagNoderesolveSRCAndUpdate(TagNode tnWithSRC, URL sourcePage)static Vector<URL>resolveSRCs(Iterable<TagNode> tnListWithSRC, URL sourcePage)static Vector<URL>resolveSRCs(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)Resolve all HREF URL's on an HTML-Page, and Update the Page-Vector Modifier and Type Method static Ret3<int[],int[],int[]>resolveAllHREF(Vector<? super TagNode> html, int sPos, int ePos, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)static Ret3<int[],int[],int[]>resolveAllHREF(Vector<? super TagNode> html, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)static Ret3<int[],int[],int[]>resolveAllHREF(Vector<? super TagNode> html, DotPair dp, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)Resolve all SRC URL's on an HTML-Page, and Update the Page-Vector Modifier and Type Method static Ret3<int[],int[],int[]>resolveAllSRC(Vector<? super TagNode> html, int sPos, int ePos, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)static Ret3<int[],int[],int[]>resolveAllSRC(Vector<? super TagNode> html, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)static Ret3<int[],int[],int[]>resolveAllSRC(Vector<? super TagNode> html, DotPair dp, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)Resolve HREF URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolveHREF_KE(TagNode tnWithHREF, URL sourcePage)static Vector<Ret2<URL,
MalformedURLException>>resolveHREFs_KE(Iterable<TagNode> tnListWithHREF, URL sourcePage)static Vector<Ret2<URL,
MalformedURLException>>resolveHREFs_KE(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)Resolve SRC URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolveSRC_KE(TagNode tnWithSRC, URL sourcePage)static Vector<Ret2<URL,
MalformedURLException>>resolveSRCs_KE(Iterable<TagNode> tnListWithSRC, URL sourcePage)static Vector<Ret2<URL,
MalformedURLException>>resolveSRCs_KE(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)More Methods Modifier and Type Method static URLgetBaseURL(Vector<? extends HTMLNode> page)static String[]NON_URL_HREFS()
-
-
-
Field Detail
-
_NON_URL_HREFS
protected static final java.lang.String[] _NON_URL_HREFS
List of documented "starter-strings" that are sometimes used in Anchor URL'HREF=...'attributes.- See Also:
NON_URL_HREFS()- Code:
- Exact Field Declaration Expression:
protected static final String[] _NON_URL_HREFS = { "tel:", "magnet:", "javascript:", "mailto:", "ftp:", "file:", "data:", "blog:", "#" };
-
-
Method Detail
-
NON_URL_HREFS
public static java.lang.String[] NON_URL_HREFS()
This small method just returns the complete list of commonly found Anchor'HREF' String'sthat do not actually constitute an HTML'URL'.This method actually returns a "clone" of an internally storedString[]Array. This is to protect and make sure that the list of potential HTML Anchor-Tag'HREF'Attributes is not changed, doctored or modified- Returns:
- A clone of the
String-array'_NON_URL_HREFS' - See Also:
_NON_URL_HREFS- Code:
- Exact Method Body:
return _NON_URL_HREFS.clone();
-
getBaseURL
public static java.net.URL getBaseURL (java.util.Vector<? extends HTMLNode> page) throws MalformedHTMLException, java.net.MalformedURLException
The methods in this class will not automatically extract any HTML<BASE HREF=URL>definitions that are found on this page. If the user wishes to dereference partial / relativeURLdefinitions that exist on the input page, all the while respecting any<BASE HREF=URL>definitions found on the input page, then this method should be utilized.- Parameters:
page- This may be any HTML page or partial page. If this page has a valid HTML<BASE HREF=URL>, it will be extracted and returned as an instance ofclass URL.- Returns:
- This shall return the HTML
<BASE HREF="http://...">element found available within the input-page parameter'page'. If the page provided does not contain aBASE URLdefinition, then null shall be returned.The HTML Specification clearly states that only oneURLmay be defined using the HTML Element<BASE>. Clearly, due to the browser wars, unspecified / non-deterministic behavior is possible if multiple definitions are provided. For the purposes of this class, if such a situation arises, an exception is thrown. - Throws:
MalformedHTMLException- If the HTML page provided contains multiple definitions of the element<BASE HREF=URL>, then this exception will throw.java.net.MalformedURLException- If the<BASE HREF=URL>found / identified within the input page, but thatURLis invalid, then this exception shall throw.- See Also:
TagNodeFind,Attributes.retrieve(Vector, int[], String)- Code:
- Exact Method Body:
int[] posArr = TagNodeFind.all(page, TC.OpeningTags, "base"); if (posArr.length == 0) return null; // NOTE: The cast is all right because 'posArr' only points to TagNode's // Attributes expects to avoid processing Vector<TextNode>, and Vector<CommentNode> // Above, there will be nothing in the 'posArr' if either of those was passed. @SuppressWarnings("unchecked") String[] urls = Attributes.retrieve((Vector<HTMLNode>) page, posArr, "href"); boolean found = false; String ret = null; for (String url : urls) if ((url != null) && (url.length() > 0)) if (found) throw new MalformedHTMLException( "The page you have provided has multiple <BASE HREF=URL> definitions. " + "However, the HTML Specifications state that pages may provide just one " + "definition. If you wish to proceed, retrieve the definitions manually " + "using class TagNodeFind.all and Attributes.retrieve, as explained in " + "the JavaDoc pages for this class." ); else { found = true; ret = url; } return new URL(ret);
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
- Code:
- Exact Method Body:
return resolveAllSRC(html, 0, -1, sourcePage, quote, askForReturnArraysOrReturnNull);
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, DotPair dp, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
- Code:
- Exact Method Body:
return resolveAllSRC (html, dp.start, dp.end + 1, sourcePage, quote, askForReturnArraysOrReturnNull);
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, int sPos, int ePos, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
This method shall resolve all partialURLaddresses that are found withinTagNodeelements having'SRC=...'attributes. Each instance ofTagNodefound in the input HTMLVectorthat has an'SRC'attribute - if the'URL'is only partially resolve - shall be updated and replaced with a newTagNodewith a fully resolvedURL.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.sPos- This is the (integer)Vector-index that sets a limit for the left-mostVector-position to inspect/search inside the inputVector-parameter. This value is considered 'inclusive' meaning that theHTMLNodeat thisVector-index will be visited by this method.If this value is negative, or larger than the length of the input-Vector, an exception will be thrown.ePos- This is the (integer)Vector-index that sets a limit for the right-mostVector-position to inspect/search inside the inputVector-parameter. This value is considered 'exclusive' meaning that the'HTMLNode'at thisVector-index will not be visited by this method.If this value is larger than the size of input theVector-parameter, an exception will throw.
Passing a negative value to this parameter,'ePos', will cause its value to be reset to the size of the inputVector-parameter.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin the HTML-Vectorwill be resolved.quote- A choice for the quotes to use. In most cases,URLattribute values do not contain quotation-marks. So likely either choice would work just fine, without exceptions.null may be passed to this parameter, and if it is, the original quotation marks found in theTagNode's 'SRC'attribute will be reused. Passing null to this parameter should almost always be easiest, safest.askForReturnArraysOrReturnNull- This (long-named) parameter is merely here to facilitate retrieving more information from this method - if necessary. When this parameter receives the following values:- TRUE: Three integer
int[]arrays will be returned as listed in theReturns:section of this method's documentation. - FALSE: This method shall return null.
- TRUE: Three integer
- Returns:
- If input parameter
'askForReturnArraysOrReturnNull'has been passedFALSE, this method shall return null. Otherwise, (if passedTRUE), then this method shall return an instance of'Ret3<int[], int[], int[]>'- which is returning three separate integer-arrays about what was found, and what has occurred.
Three arrays are returned as a result of this method's invocation. Keep in mind that though the information might be superfluous, rejecting these arrays away is easy. They are provided as a matter of convenience for cases where more details information is mandatory for ensuring that long lists ofHTMLNode'swere properly updated.-
Ret3.a (int[])
The firstint[] arrayshall contain a list of the index of everyTagNodein the input-Vectorparameter's range that contained a non-null HTML'SRC'Attribute.
-
Ret3.b (int[])
The secondint[] arraywill contain an index-list of the indices which containedTagNode'sthat were replaced by the internal-resolve logic.
-
Ret3.c (int[])
The thirdint[] arraywill contain an index-list of the indices which containedTagNode'swhose'SRC=...'attribute failed to be resolved by the internal-resolve logic, or caused aQuotesExceptionto throw.
-
- Throws:
java.lang.IndexOutOfBoundsException- This exception shall be thrown if any of the following are true:- If
'sPos'is negative, or ifsPosis greater-than-or-equal-to thesizeof theVector - If
'ePos'is zero, or greater than the size of theVector - If the value of
'sPos'is a larger integer than'ePos'. If'ePos'was negative, it is first reset toVector.size(), before this check is done.
- If
- See Also:
resolve(String, URL),TagNode.AV(String),TagNode.setAV(String, String, SD)- Code:
- Exact Method Body:
// Retrieve the Vector-location of any TagNode on the page that has // a "SRC=..." attribute. These are almost always HTML <IMG> elements. // NOTE: FIND Method's are "READ ONLY" - the Cast will make no difference at run-time. // The @SuppressWarnings is to overcome the cast of 'html' @SuppressWarnings("unchecked") int[] hasSrcPosArr = InnerTagFind.all((Vector<HTMLNode>) html, sPos, ePos, "src"); // Java Stream's are convenient for keeping "Growing Lists" of return values. // This builder shall keep a list of all URL's that failed to update - for any reason // **UNLESS** the reason is that the URL was already a fully-resolved, non-partial URL IntStream.Builder failedUpdate = askForReturnArraysOrReturnNull ? IntStream.builder() : null; // This stream will keep a list of all URL's that were updated, and whose TagNode's // were replaced inside the input HTML Vector IntStream.Builder replaced = askForReturnArraysOrReturnNull ? IntStream.builder() : null; for (int pos : hasSrcPosArr) { // Get the node at the index TagNode tn = (TagNode) html.elementAt(pos); // 1) Retrieve the SRC Attribute // 2) if it is a partial-URL resolve it // 3) Convert to a String String oldURL = tn.AV("src"); URL newURL = resolve(oldURL, sourcePage); // Some URL's cannot be resolved, if so, just skip this TagNode. // Log the index to the stream (if requested), and continue. if (newURL == null) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // If the URL was already a fully-resolved-URL, continue - don't replace the TagNode; // No logging needed here, the URL was *already* resolved... if (oldURL.length() == newURL.toString().length()) continue; // Replace the SRC Attribute in the TagNode. This builds a new instance of TagNode // If there is an exception, log the index to the stream (if requested), and continue. try { tn = tn.setAV("src", newURL.toString(), quote); } catch (QuotesException qex) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // Replace the index in the Vector containing the old TagNode with the new one. html.setElementAt(tn , pos); // The Vector-Index at this position had it's old TagNode removed and replaced with a // new updated one. Log this to the stream-list so to allow the user to know. if (askForReturnArraysOrReturnNull) replaced.accept(pos); } return askForReturnArraysOrReturnNull ? new Ret3<int[], int[], int[]> (hasSrcPosArr, replaced.build().toArray(), failedUpdate.build().toArray()) : null;
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
- Code:
- Exact Method Body:
return resolveAllHREF(html, 0, -1, sourcePage, quote, askForReturnArraysOrReturnNull);
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, DotPair dp, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
- Code:
- Exact Method Body:
return resolveAllHREF (html, dp.start, dp.end + 1, sourcePage, quote, askForReturnArraysOrReturnNull);
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, int sPos, int ePos, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
This method shall resolve all partialURLaddresses that are found withinTagNodeelements having'HREF=...'attributes. Each instance ofTagNodefound in the input HTMLVectorthat has an'HREF'attribute - if the'URL'is only partially resolve - shall be updated and replaced with a newTagNodewith a fully resolvedURL.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'means that aVector<TagNode>or aVector<HTMLNode>are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>is passed, and there are no instances ofclass TagNodecontained by that Vector, then this method will simply exit gracefully.sPos- This is the (integer)Vector-index that sets a limit for the left-mostVector-position to inspect/search inside the inputVector-parameter. This value is considered 'inclusive' meaning that theHTMLNodeat thisVector-index will be visited by this method.If this value is negative, or larger than the length of the input-Vector, an exception will be thrown.ePos- This is the (integer)Vector-index that sets a limit for the right-mostVector-position to inspect/search inside the inputVector-parameter. This value is considered 'exclusive' meaning that the'HTMLNode'at thisVector-index will not be visited by this method.If this value is larger than the size of input theVector-parameter, an exception will throw.
Passing a negative value to this parameter,'ePos', will cause its value to be reset to the size of the inputVector-parameter.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin the HTML-Vectorwill be resolved.quote- A choice for the quotes to use. In most cases,URLattribute values do not contain quotation-marks. So likely either choice would work just fine, without exceptions.null may be passed to this parameter, and if it is the original quotation marks found in theTagNode's 'HREF'attribute will be reused. Passing null to this parameter should almost always be easiest, safest.askForReturnArraysOrReturnNull- This (long-named) parameter is merely here to facilitate retrieving more information from this method - if necessary. When this parameter receives the following values:- TRUE: Three integer
int[]arrays will be returned as listed in theReturns:section of this method's documentation. - FALSE: This method shall return null.
- TRUE: Three integer
- Returns:
- If input parameter
'askForReturnArraysOrReturnNull'has been passedFALSE, this method shall return null. Otherwise, (if passedTRUE), then this method shall return an instance of'Ret3<int[], int[], int[]>'- which is returning three separate integer-arrays about what was found, and what has occurred.
Three arrays are returned as a result of this method's invocation. Keep in mind that though the information might be superfluous, rejecting these arrays away is easy. They are provided as a matter of convenience for cases where more details information is mandatory for ensuring that long lists ofHTMLNode'swere properly updated.-
Ret3.a (int[])
The firstint[] arrayshall contain a list of the index of everyTagNodein the input-Vectorparameter's range that contained a non-null HTML'HREF'Attribute.
-
Ret3.b (int[])
The secondint[] arraywill contain an index-list of the indices which containedTagNode'sthat were replaced by the internal-resolve logic.
-
Ret3.c (int[])
The thirdint[] arraywill contain an index-list of the indices which containedTagNode'swhose'HREF=...'attribute failed to be resolved by the internal-resolve logic, or caused aQuotesExceptionto throw.
-
- Throws:
java.lang.IndexOutOfBoundsException- This exception shall be thrown if any of the following are true:- If
'sPos'is negative, or ifsPosis greater-than-or-equal-to thesizeof theVector - If
'ePos'is zero, or greater than the size of theVector - If the value of
'sPos'is a larger integer than'ePos'. If'ePos'was negative, it is first reset toVector.size(), before this check is done.
- If
- See Also:
resolve(String, URL),TagNode.AV(String),TagNode.setAV(String, String, SD)- Code:
- Exact Method Body:
// Retrieve the Vector-location of any TagNode on the page that has // a "HREF=..." attribute. These are almost always HTML <IMG> elements. // NOTE: FIND Method's are "READ ONLY" - the Cast will make no difference at run-time. // The @SuppressWarnings is to overcome the cast of 'html' @SuppressWarnings("unchecked") int[] hasHRefPosArr = InnerTagFind.all((Vector<HTMLNode>) html, sPos, ePos, "href"); // Java Stream's are convenient for keeping "Growing Lists" of return values. // This builder shall keep a list of all URL's that failed to update - for any reason // **UNLESS** the reason is that the URL was already a fully-resolved, non-partial URL IntStream.Builder failedUpdate = askForReturnArraysOrReturnNull ? IntStream.builder() : null; // This stream will keep a list of all URL's that were updated, and whose TagNode's // were replaced inside the input HTML Vector IntStream.Builder replaced = askForReturnArraysOrReturnNull ? IntStream.builder() : null; for (int pos : hasHRefPosArr) { // Get the node at the index TagNode tn = (TagNode) html.elementAt(pos); // 1) Retrieve the HREF Attribute // 2) if it is a partial-URL resolve it // 3) Convert to a String String oldURL = tn.AV("HREF"); URL newURL = resolve(oldURL, sourcePage); // Some URL's cannot be resolved, if so, just skip this TagNode. // Log the index to the stream (if requested), and continue. if (newURL == null) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // If the URL was already a fully-resolved-URL, continue - don't replace the TagNode; // No logging needed here, the URL was *already* resolved... if (oldURL.length() == newURL.toString().length()) continue; // Replace the HREF Attribute in the TagNode. This builds a new instance of TagNode // If there is an exception, log the index to the stream (if requested), and continue. try { tn = tn.setAV("href", newURL.toString(), quote); } catch (QuotesException qex) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // Replace the index in the Vector containing the old TagNode with the new one. html.setElementAt(tn , pos); // The Vector-Index at this position had it's old TagNode removed and replaced with a // new updated one. Log this to the stream-list so to allow the user to know. if (askForReturnArraysOrReturnNull) replaced.accept(pos); } return askForReturnArraysOrReturnNull ? new Ret3<int[], int[], int[]> (hasHRefPosArr, replaced.build().toArray(), failedUpdate.build().toArray()) : null;
-
resolveHREFAndUpdate
public static TagNode resolveHREFAndUpdate(TagNode tnWithHREF, java.net.URL sourcePage)
- Code:
- Exact Method Body:
URL url = resolveHREF(tnWithHREF, sourcePage); return (url == null) ? null : tnWithHREF.setAV("href", url.toString(), null);
-
resolveHREF
public static java.net.URL resolveHREF(TagNode tnWithHREF, java.net.URL sourcePage)
This should be used forTagNode'sthat contain an'HREF'inner-tag (attribute).- Parameters:
tnWithHREF- This may be any HTML Element that contains an'HREF'attribute.An HTML 'anchor' element (< HREF=...>) will contain these. Often theURL'sfound here contain "relative" rather than "absolute" addresses.sourcePage- This is the source pageURLfrom which theTagNode(possibly-relative)URLwill be resolved.- Returns:
- A complete-
URLwithout any missing "presumed data" - such as host/domain or directory. Null is returned if attempting to build theURLgenerated aMalformedURLException.SPECIFICALLY: This method shall catch allMalformedURLException's. - Throws:
HREFException- If theTagNodepassed to parameter'tnWithHREF'does not actually contain anHREFattribute, then this exception shall throw.- See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
String href = tnWithHREF.AV("href"); if (href == null) throw new HREFException( "The TagNode passed to parameter tnWithHREF does not actually contain an " + "HREF attribute." ); return resolve(href, sourcePage);
-
resolveSRCAndUpdate
public static TagNode resolveSRCAndUpdate(TagNode tnWithSRC, java.net.URL sourcePage)
- Code:
- Exact Method Body:
URL url = resolveSRC(tnWithSRC, sourcePage); return (url == null) ? null : tnWithSRC.setAV("src", url.toString(), null);
-
resolveSRC
public static java.net.URL resolveSRC(TagNode tnWithSRC, java.net.URL sourcePage)
This should be used forTagNode'sthat contain a'SRC'inner-tag (attribute).- Parameters:
tnWithSRC- This may be any HTML Element that contains a'SRC'attribute.An HTML 'image' element (<IMG SRC=...>) will contain these. Often theURL'sfound here contain "relative" rather than "absolute" addresses.sourcePage- This is the source pageURLfrom which theTagNode(possibly-relative)URLwill be resolved.- Returns:
- A complete-
URLwithout any missing "presumed data" - such as host/domain or directory. Null is returned if attempting to build theURLgenerated aMalformedURLException.SPECIFICALLY: This method shall catch allMalformedURLException's. - Throws:
SRCException- If theTagNodepassed to parameter'tnWithSRC'does not actually contain aSRCattribute, then this exception shall throw.- See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
String src = tnWithSRC.AV("src"); if (src == null) throw new SRCException( "The TagNode passed to parameter tnWithSRC does not actually contain a " + "SRC attribute." ); return resolve(src, sourcePage);
-
resolveHREFs
public static java.util.Vector<java.net.URL> resolveHREFs (java.lang.Iterable<TagNode> tnListWithHREF, java.net.URL sourcePage)
This should be used for lists ofTagNode's, each of which contain an'HREF'inner-tag (attribute).- Parameters:
tnListWithHREF- This may be any list of HTML Elements, each of which must be instances ofclass TagNodeand all of which must have a'HREF'attribute.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theIterablewill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. AnyTagNodewhich generated an exception, will result in a null value in theVector.SPECIFICALLY: If any of the elements intnListWithHREFdo not contain anHREFinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (TagNode tn : tnListWithHREF) ret.addElement(resolve(tn.AV("href"), sourcePage)); return ret;
-
resolveSRCs
public static java.util.Vector<java.net.URL> resolveSRCs (java.lang.Iterable<TagNode> tnListWithSRC, java.net.URL sourcePage)
This should be used for lists ofTagNode's, each of which contain a'SRC'inner-tag (attribute).- Parameters:
tnListWithSRC- This may be any list of HTML Elements, each of which must be instances ofclass TagNodeand all of which must have a'SRC'attribute.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theIterablewill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. AnyTagNodewhich generated an exception, will result in a null value in theVector.SPECIFICALLY: If any of the elements intnListWithSRCdo not contain aSRCinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (TagNode tn : tnListWithSRC) ret.addElement(resolve(tn.AV("src"), sourcePage)); return ret;
-
resolveHREFs
public static java.util.Vector<java.net.URL> resolveHREFs (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's. TheTagNode'sto which this pointer-array points - must each contain anHREFinner-tag with aURL, or a partialURL.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'means that aVector<TagNode>, Vector<TextNode>orVector<CommentNode>will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'vpackage.nodePosArr- An array of pointers into the page or sub-page. The pointers must referenceTagNode'sthat containHREFattributes. Integer-pointer Arrays are usually returned from thepackage 'NodeSearch'"Find" methods.
Example:
// Retrieve 'pointers' to all the '<A HREF=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] anchorPosArr = TagNodeFind.all(page, TC.OpeningTags, "a"); // Extract each HREF inner-tag, and construct a {@code URL}. Use the 'sourcePage' parameter // if the URL is only partially-resolved Vector<URL> urls = Links.resolveHREFs(page, anchorPosArr, mySourcePage);
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<A ...>"element that was available in the HTML page-Vectorparameter'html', and then resolve any shortenedURL's.sourcePage- This is the source pageURLfrom whence the (possibly relative)TagNode URL'sin theVectorare to be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. AnyTagNodewhich generated an exception, will result in a null value in theVector. However, if any of the nodes pointed to by the'nodePosArr'parameter do not contain openingTagNodeelements, then this mistake shall generateTagNodeExpectedException's.SPECIFICALLY: If any of the elements intnListWithHREFdo not contain anHREFinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - Throws:
java.lang.ArrayIndexOutOfBoundsException- If any of the elements in'posArr'contain index-pointers that are out of range ofVector-parameter'page', then java will, naturally, throw this exception.OpeningTagNodeExpectedException- When aVectorposition-index holds an instance ofTagNode, but thatTagNodeis one in which itsisClosing-Field is set toTRUE, then this exception shall throw.
When passingint[]-Array parameter'posArr', that array should contain a list ofVector-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException- This exception shall throw if an identifiedVector-index must point-to an instance ofTagNode, but that index instead holds some otherHTMLNodeinstance (eitherCommentNodeorTextNode). If an integer-position array (int[] posArr) is passed, but that array has an index pointing-to - something besides aTagNode- then this exception will be thrown.- See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
// Return Vector Vector<URL> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the 'HREF', save the URL ret.addElement(resolve(tn.AV("href"), sourcePage)); } return ret;
-
resolveSRCs
public static java.util.Vector<java.net.URL> resolveSRCs (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's. TheTagNode'sto which this pointer-array points - must each contain aSRCinner-tag with aURL, or a partialURL.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'means that aVector<TagNode>, Vector<TextNode>orVector<CommentNode>will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'vpackage. Any HTML page (or sub-page)nodePosArr- An array of pointers into the page or sub-page. The pointers must referenceTagNode'sthat containSRCattributes. Integer-pointer Arrays are usually returned from thepackage 'NodeSearch'"Find" methods.
Example:
// Retrieve 'pointers' to all the '<IMG SRC=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] picturePosArr = TagNodeFind.all(page, TC.OpeningTags, "img"); // Extract each SRC inner-tag, and construct a {@code URL}. Use the 'sourcePage' parameter // if the URL is only partially-resolved Vector<URL> urls = Links.resolveSRCs(page, picturePosArr, mySourcePage);
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<IMG ...>"element that was available in the HTML page-Vectorparameter'html', and then resolve any shorted imageURL's.sourcePage- This is the source pageURLfrom whence the (possibly relative)TagNode URL'sin theVectorare to be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. AnyTagNodewhich generated an exception, will result in a null value in theVector. However, if any of the nodes pointed to by the'nodePosArr'parameter do not contain openingTagNodeelements, then this mistake shall generateTagNodeExpectedException's.SPECIFICALLY: If any of the elements intnListWithSRCdo not contain aSRCinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - Throws:
java.lang.ArrayIndexOutOfBoundsException- If any of the elements in'posArr'contain index-pointers that are out of range ofVector-parameter'page', then java will, naturally, throw this exception.OpeningTagNodeExpectedException- When aVectorposition-index holds an instance ofTagNode, but thatTagNodeis one in which itsisClosing-Field is set toTRUE, then this exception shall throw.
When passingint[]-Array parameter'posArr', that array should contain a list ofVector-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException- This exception shall throw if an identifiedVector-index must point-to an instance ofTagNode, but that index instead holds some otherHTMLNodeinstance (eitherCommentNodeorTextNode). If an integer-position array (int[] posArr) is passed, but that array has an index pointing-to - something besides aTagNode- then this exception will be thrown.- See Also:
resolve(String, URL),TagNode.AV(String)- Code:
- Exact Method Body:
// Return Vector Vector<URL> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the "SRC", save the URL ret.addElement(resolve(tn.AV("src"), sourcePage)); } return ret;
-
resolve
public static java.util.Vector<java.net.URL> resolve (java.util.Vector<java.lang.String> src, java.net.URL sourcePage)
This will convert a list of simple javaString'sto a list/VectorofURL's, de-referencing any missing information using the'sourcePage'parameter.- Parameters:
src- a list of strings - usually partially or totally completed InternetURL'ssourcePage- This is the source pageURLfrom which theString's(possibly-relative)URL'sin theVectorwill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyString'sthat were zero-length or null, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill be null. - See Also:
resolve(String, URL)- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (String s : src) ret.addElement(resolve(s, sourcePage)); return ret;
-
resolve
public static java.net.URL resolve(java.lang.String src, java.net.URL sourcePage)
This will convert a simple javaStringto aURL, de-referencing any missing information using the'sourcePage'parameter.- Parameters:
src- Any javaString, usually one which was scraped from an HTML-Page, and needs to be "completed."sourcePage- This is the source pageURLfrom which the String (possibly-relative)URLwill be resolved.- Returns:
- A
URL, which has been completed/resolved with the'sourcePage'parameter. If parameter'src'is null or zero-length, then this method will also return null. If aMalformedURLExceptionis generated, null will also be returned. - Code:
- Exact Method Body:
if (sourcePage == null) throw new NullPointerException( "Though you may provide null to the partial-URL to dereference parameter, null " + "may not be passed to the Source-Page Parameter. The purpose of the 'resolve' " + "operation is to resolve partial-URLs against a source-page (root) URL. " + "Therefore this is not allowed." ); if (src == null) return null; src = src.trim(); if (src.length() == 0) return null; String srcLC = src.toLowerCase(); if (StrCmpr.startsWithXOR(srcLC, _NON_URL_HREFS)) return null; if (srcLC.startsWith("http://") || srcLC.startsWith("https://")) try { return new URL(src); } catch (MalformedURLException e) { return null; } if (src.startsWith("//") && (src.charAt(3) != '/')) try { return new URL(sourcePage.getProtocol().toLowerCase() + ":" + src); } catch (MalformedURLException e) { return null; } if (src.startsWith("/")) try { return new URL( sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase() + src ); } catch (MalformedURLException e) { return null; } if (src.startsWith("../")) { String sourcePageStr = sourcePage.toString(); short nLevels = 0; do { nLevels++; src = src.substring(3); } while (src.startsWith("../")); String directory = StringParse.dotDotParentDirectory(sourcePage.toString(), nLevels); try { return new URL(directory + src); } catch (Exception e) { return null; } } String root = sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase(); String path = sourcePage.getPath().trim(); int pos = StringParse.findLastFrontSlashPos(path); if (pos == -1) throw new StringIndexOutOfBoundsException( "The URL you have provided: " + sourcePage.toString() + " does not have a '/' " + "front-slash character in it's path. Cannot proceed resolving relative-URL's " + "without this." ); path = path.substring(0, pos + 1); try { return new URL(root + path + src); } catch (MalformedURLException e) { return null; }
-
resolveHREF_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolveHREF_KE (TagNode tnWithHREF, java.net.URL sourcePage)
This should be used forTagNode'sthat contain an'HREF'inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
tnWithHREF- This may be any HTML Element that contains an'HREF'attribute.An HTML 'anchor' element (< HREF=...>) will contain these. Often theURL'sfound here contain "relative" rather than "absolute" addresses.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URLwill be resolved.- Returns:
- A complete-
URLwithout any missing "presumed data" - such as host/domain or directory. If there were noHREFtag, then null is returned. If theTagNodecauses aMalformedURLException, that is returned inRet2.bSPECIFICALLY: This method shall catch allMalformedURLException's.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- Throws:
HREFException- If theTagNodepassed to parameter'tnWithHREF'does not actually contain anHREFattribute, then this exception shall throw.- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
String href = tnWithHREF.AV("href"); if (href == null) throw new HREFException( "The TagNode passed to parameter tnWithHREF does not actually contain an " + "HREF attribute." ); return LinksResolve_KE.resolve(href, sourcePage);
-
resolveSRC_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolveSRC_KE (TagNode tnWithSRC, java.net.URL sourcePage)
This should be used forTagNode'sthat contain a'SRC'inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
tnWithSRC- This may be any HTML Element that contains a'SRC'attribute.An HTML 'image' element (<IMG SRC=...>) will contain these. Often theURL'sfound here contain "relative" rather than "absolute" addresses.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URLwill be resolved.- Returns:
- A complete-
URLwithout any missing "presumed data" - such as host/domain or directory. If there were noSRCtag, then null is returned. If theTagNodecauses aMalformedURLException, that is returned inRet2.bSPECIFICALLY: This method shall catch allMalformedURLException's.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- Throws:
SRCException- If theTagNodepassed to parameter'tnWithSRC'does not actually contain aSRCattribute, then this exception shall throw.- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
String src = tnWithSRC.AV("src"); if (src == null) throw new SRCException( "The TagNode passed to parameter tnWithSRC does not actually contain a " + "SRC attribute." ); return LinksResolve_KE.resolve(src, sourcePage);
-
resolveHREFs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveHREFs_KE (java.lang.Iterable<TagNode> tnListWithHREF, java.net.URL sourcePage)
This should be used for lists ofTagNode's, each of which contain an'HREF'inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
tnListWithHREF- This may be any list of HTML Elements, each of which must be instances ofclass TagNodeand all of which must have a'HREF'attribute.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theIterablewill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyTagNodewith noHREFtag, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill contain the exception inRet2.bSPECIFICALLY: If any of the elements intnListWithHREFdo not contain anHREFinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (TagNode tn : tnListWithHREF) ret.addElement(LinksResolve_KE.resolve(tn.AV("href"), sourcePage)); return ret;
-
resolveSRCs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveSRCs_KE (java.lang.Iterable<TagNode> tnListWithSRC, java.net.URL sourcePage)
This should be used for lists ofTagNode's, each of which contain a'SRC'inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
tnListWithSRC- This may be any list of HTML Elements, each of which must be instances ofclass TagNodeand all of which must have a'SRC'attribute.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theIterablewill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyTagNodewith noSRCtag, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill contain the exception inRet2.bSPECIFICALLY: If any of the elements intnListWithSRCdo not contain aSRCinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (TagNode tn : tnListWithSRC) ret.addElement(LinksResolve_KE.resolve(tn.AV("src"), sourcePage)); return ret;
-
resolveHREFs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveHREFs_KE (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's. TheTagNodeto which this pointer-array points - must containHREFinner-tags withURL's, or partialURL's.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'means that aVector<TagNode>, Vector<TextNode>orVector<CommentNode>will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'vpackage. Any HTML page (or sub-page)nodePosArr- An array of pointers into the page or sub-page. The pointers must referenceTagNode'sthat containHREFattributes. Integer-pointer Arrays are usually return from thepackage 'NodeSearch'"Find" methods.
Example:
// Retrieve 'pointers' to all the '<A HREF=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] anchorPosArr = TagNodeFind.all(page, TC.OpeningTags, "a"); // Extract each HREF inner-tag, and construct a URL. Use the 'sourcePage' parameter if // the URL is only partially-resolved. If any URL's on the original-page are invalid, the // method shall not crash, but save the exception instead. Vector<Ret2<URL, MalformedURLException> urlsWithEx = Links.resolveHREFs_KE(page, picturePosArr, mySourcePage); // Print out any "failed" urls for (Ret2<URL, MalformedURLException> r : urlsWithEx) if (r.b != null) System.out.println("There was an exception: " + r.b.toString());
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<A ...>"element that was available in the HTML page-Vectorparameter'html'., and then resolve any shortenedURL's.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theVectorwill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyTagNodewith noHREFtag, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill contain the exception inRet2.bSPECIFICALLY: If any of the elements intnListWithHREFdo not contain anHREFinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- Throws:
java.lang.ArrayIndexOutOfBoundsException- If any of the elements in'posArr'contain index-pointers that are out of range ofVector-parameter'page', then java will, naturally, throw this exception.OpeningTagNodeExpectedException- When aVectorposition-index holds an instance ofTagNode, but thatTagNodeis one in which itsisClosing-Field is set toTRUE, then this exception shall throw.
When passingint[]-Array parameter'posArr', that array should contain a list ofVector-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException- This exception shall throw if an identifiedVector-index must point-to an instance ofTagNode, but that index instead holds some otherHTMLNodeinstance (eitherCommentNodeorTextNode). If an integer-position array (int[] posArr) is passed, but that array has an index pointing-to - something besides aTagNode- then this exception will be thrown.- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
// Return Vector Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the "HREF", keep the URL ret.addElement(LinksResolve_KE.resolve(tn.AV("href"), sourcePage)); } return ret;
-
resolveSRCs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveSRCs_KE (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's. TheTagNodeto which this pointer-array points - must containSRCinner-tags withURL's, or partialURL's.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector(using a parameter such asVector<HTMLNode>) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL, and then pass that result to input-parametersourcePage.
More recently, HTML-Pages are making less use of<BASE>HTML-Tag.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
html- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'means that aVector<TagNode>, Vector<TextNode>orVector<CommentNode>will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'vpackage. Any HTML page (or sub-page)nodePosArr- An array of pointers into the page or sub-page. The pointers must referenceTagNode'sthat containSRCattributes. Integer-pointer Arrays are usually return from thepackage 'NodeSearch'"Find" methods.
Example:
// Retrieve 'pointers' to all the '<IMG SRC=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] picturePosArr = TagNodeFind.all(page, TC.OpeningTags, "img"); // Extract each SRC inner-tag, and construct a URL. Use the 'sourcePage' parameter if // the URL is only partially-resolved. If any URL's on the original-page are invalid, // the method shall not crash, but save the exception instead. Vector<Ret2<URL, MalformedURLException> urlsWithEx = Links.resolveSRCs_KE(page, picturePosArr, mySourcePage); // Print out any "failed" urls for (Ret2<URL, MalformedURLException> r : urlsWithEx) if (r.b != null) System.out.println("There was an exception: " + r.b.toString());
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<IMG ...>"element that was available in the HTML page-Vectorparameter'html', and then resolve any shortenedURL's.sourcePage- This is the source pageURLfrom which theTagNode's(possibly-relative)URL'sin theVectorwill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyTagNodewith noSRCtag, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill contain the exception inRet2.bSPECIFICALLY: If any of the elements intnListWithSRCdo not contain aSRCinner-tag, then the method will default, and also cause a null return value in theVector. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- Throws:
java.lang.ArrayIndexOutOfBoundsException- If any of the elements in'posArr'contain index-pointers that are out of range ofVector-parameter'page', then java will, naturally, throw this exception.OpeningTagNodeExpectedException- When aVectorposition-index holds an instance ofTagNode, but thatTagNodeis one in which itsisClosing-Field is set toTRUE, then this exception shall throw.
When passingint[]-Array parameter'posArr', that array should contain a list ofVector-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException- This exception shall throw if an identifiedVector-index must point-to an instance ofTagNode, but that index instead holds some otherHTMLNodeinstance (eitherCommentNodeorTextNode). If an integer-position array (int[] posArr) is passed, but that array has an index pointing-to - something besides aTagNode- then this exception will be thrown.- See Also:
resolve_KE(String, URL),TagNode.AV(String),Ret2- Code:
- Exact Method Body:
// Return Vector Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve "SRC" and keep URL's ret.addElement(LinksResolve_KE.resolve(tn.AV("src"), sourcePage)); } return ret;
-
resolve_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolve_KE (java.util.Vector<java.lang.String> src, java.net.URL sourcePage)
Resolve allURL's, represented asString's, inside of aVector.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
src- a list ofString's- usually partially or totally completed InternetURL'ssourcePage- This is the source pageURLfrom which theString's(possibly-relative)URL'sin theVectorwill be resolved.- Returns:
- A list of
URL's, each of which have been completed/resolved with the'sourcePage'parameter. If there were anyString'sthat were zero-length or null, then null is returned in the relatedVectorposition. If anyTagNodecauses aMalformedURLException, then that position in theVectorwill contain the exception inRet2.bRet2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL),Ret2- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (String s : src) ret.addElement(LinksResolve_KE.resolve(s, sourcePage)); return ret;
-
resolve_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolve_KE (java.lang.String src, java.net.URL sourcePage)
This will convert a simple javaStringto aURL, de-referencing any missing information using the'sourcePage'parameter.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:, precisely one of the two references will be non-null. If theURLwas properly resolved, then theURLfield (fieldRet2.a) will be non-null. Otherwise, theMalformedURLExceptionfield (fieldRet2.b) will be non-null.- Parameters:
src- Any javaString, usually one which was scraped from an HTML-Page, and needs to be "completed."sourcePage- This is the source pageURLfrom which the String (possibly relative)URLwill be resolved.- Returns:
- A
URL, which has been completed/resolved with the'sourcePage'parameter. If parameter'src'is null or zero-length, null will be returned. If aMalformedURLExceptionis thrown, that will be included with theRet2<>result.Ret2.a (URL)
This shall contain the fully resolvedURL- resolved using the parameter'sourcePage'as the Base-URL.
Ret2.b (MalformedURLException)
If there were any problems resolving theURL- such that an exception was thrown while producing the resolved-URL, the exception thrown will be caught and returned as a reference instead.
- See Also:
Ret2- Code:
- Exact Method Body:
return LinksResolve_KE.resolve(src, sourcePage);
-
-