Package Torello.HTML
Class Links
- java.lang.Object
-
- Torello.HTML.Links
-
public class Links extends java.lang.Object
Utilities for de-refrencing 'partially-completed'URL's
in a Web-PageVector
.
This is a utility class that helps 'complete' URLs that are often scraped from web-pages, and are 'relative' (partially completed) URLs. This is a common occurrence in browsers, when people do not need to present an entire directory and web-server DNS name for retrieving an image file or link that resides in the same directory as the web-page URL of the page in which that link resides.
CONTENT-NOTE:
These scrape-package classes were initially developed for scraping news-content from the Chinese Government Web-Portal, and redirecting over-seas news-content to a simple translation service for people interested in reading about news from over-seas. This is particularly interesting for a government such as China, were a huge percentage of our economic GDP based on products exported from factories in the Southern Region there to our strip-malls here in Dallas (and other places). Perhaps these URL examples may not seem relevant to a typical Internet-Programmer who is not presently studying languages, but they are staying here anyway.
Specifically: In addition to Java - Chinese, Spanish, German etc... are also interesting languages to study.
EXCEPTION SUPPRESSION:
Precisely half of these methods are designed to "sweep" an entire page of HTML. The methods that expect an vector of anchors, images, or other links and iterate over the entire HTML-Vector
or page will catch any and all exception-throws of typeMalformedURLException
, and placenull
in the return-Vector
position for that particular URL.
The value of this is, of course, that all links that can be resolved, by the nature of exception-suppression, will be resolved. Checking the return-Vector's
for null-values is necessary when pages that contain broken links or image-sources is important. However, each method that ends with the letter 'KE' shall return aVector
that includes any thrown exception in the Java-HTML Tuple-ClassRet2<URL, MalformedURLException>
.
This concept may seem 'unique,' but once this process is familiar - the value of not being forced to writetry-catch
blocks for every web-pageURL
-resolution-stage in your programs will hopefully become obvious.
EXAMLES TABLE:
The following table attempts to explain the rules for evaluating relative / partialURL's
, such as an HTML'<A ...>'
(Anchor-Tag)'HREF=...' URL
, or an<IMG SRC="..."> URL
. The column on the left portrays the type ofTagNode
-input containing a URL - which could be a partialURL
- while the column on the right hopefully demystifies how such aURL
would be "decoded" (de-referenced) from a partial to a completeUniform Resource Locator
.HTML TagNode sourceURL:
http://english.gov.CN/article/01-01-2018/index.html
<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.bmp">
http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="/article/01-01-2018/image12345.bmp">
http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="image12345.bmp">
http://english.gov.CN/article/01-01-2018/image12345.bmp <IMG SRC="//some.other.url/a.bmp">
http://some.other.url/a.bmp <A HREF="#sub-section">
null
<IMG SRC="../../pic2.bmp">
http://english.gov.CN/pic2.bmp <A HREF="tel: (212) 555-6789">
null
HTML TagNode sourceURL:
http://english.gov.CN/article/12-31-2018/index.html
<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.png">
http://english.gov.CN/article/01-01-2018/image12345.png <IMG SRC="/article/01-01-2018/image12345.png">
http://english.gov.CN/article/01-01-2018/image12345.png <IMG SRC="image12345.png">
http://english.gov.CN/article/12-31-2018/image12345.png <IMG SRC="//some.other.url/a.bmp">
http://some.other.url/a.bmp <A HREF="#sub-section">
null
<IMG SRC="../pic3.bmp">
http://english.gov.CN/article/pic3.bmp <A HREF="mailto: [email protected]">
null
HTML TagNode sourceURL:
http://SpanishNewsBoard.com/article/10-12-2018/index.html
<IMG SRC="http://english.gov.CN/article/01-01-2018/image12345.jpg">
http://english.gov.CN/article/01-01-2018/image12345.jpg <IMG SRC="/article/01-01-2018/image12345.jpg">
http://SpanishNewsBoard.com/article/01-01-2018/image12345.jpg <IMG SRC="image12345.jpg">
http://SpanishNewsBoard.com/article/10-12-2018/image12345.jpg <IMG SRC="//some.other.url/a.bmp">
http://some.other.url/a.bmp <A HREF="#sub-section">
null
<IMG SRC="../../../pic3.bmp">
null
<A HREF="javascript: alert("hello world);">
null
The following example will find all HTML<A HREF="...">
(anchor-tags), and replace theHREF
value it finds with an absolute url-link
Example:
// This fixes the body of a "web-page news-article" (or any web-site html, so to speak) // It assures that (after scraping) any original Anchor URL's which contained "relative links" // become "absolute links" - by completing the URL. // The original web-site url URL webSiteURL = new URL("https://some-web-site.com/News/Article-Numero-Uno.html"); // Here the HTML page is downloaded to a simple Java Vector. Vector<HTMLNode> page = HTMLPage.getPageTokens(webSiteURL, false); // Any URL's which do not contain complete URI's - inclusive of a domain-name, directory, // and file-name will be completed and inserted back into the page. Links.resolveAllHREF(page, webSiteURL, SD.SingleQuotes, false);
COMMON SPECIAL CASES:
The following special cases for commonly foundHREF
-Attributes includeURL
-Links that are not intended to point to HTML pages. The following rather commonly found values for HTML Anchor TagHREF
-Attributes that will cause this class to return null and/or return an exception include these:<A HREF="tel:<a-telephone-number>" ... >
<A HREF="javascript:<some-script-calls>" ... >
<A HREF="mailto:<an-email-address>" ... >
<A HREF="file:<file-for-download>" ... >
<A HREF="ftp:<ftp-file-transfer-protocol-address>" ... >
<A HREF="magnet:<bit-torrent-address>" ...>
<A HREF="data:<base64-encoded-image>" ... >
<A HREF="blob:<Binary-Large-Object>" ... >
<A HREF="#<this-page-subsection>" ... >
Any call to resolve an HTML Anchor element whose URL link begins with the above special-cases will return null, or, if the "Keep Exception" (_KE) version is requested aTorello.Java.Ret2<URL, HREFException>
will be returned where the value ofret2.a
is null, and the value ofret2.b
is an instance of anHREFException
- See Also:
ReplaceNodes
,ReplaceFunction
,HTMLPage
,InnerTagFind
,Ret2
Hi-Lited Source-Code:- View Here: Torello/HTML/Links.java
- Open New Browser-Tab: Torello/HTML/Links.java
File Size: 61,318 Bytes Line Count: 1,451 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 26 Method(s), 26 declared static
- 1 Field(s), 1 declared static, 1 declared final
-
-
Field Summary
Fields Modifier and Type Field protected static String[]
_NON_URL_HREFS
-
Method Summary
Resolve URL's Modifier and Type Method static URL
resolve(String src, URL sourcePage)
static Vector<URL>
resolve(Vector<String> src, URL sourcePage)
Resolve URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolve_KE(String src, URL sourcePage)
static Vector<Ret2<URL,
MalformedURLException>>resolve_KE(Vector<String> src, URL sourcePage)
Resolve HREF-Attribute URL's Modifier and Type Method static URL
resolveHREF(TagNode tnWithHREF, URL sourcePage)
static TagNode
resolveHREFAndUpdate(TagNode tnWithHREF, URL sourcePage)
static Vector<URL>
resolveHREFs(Iterable<TagNode> tnListWithHREF, URL sourcePage)
static Vector<URL>
resolveHREFs(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)
Resolve SRC-Attribute URL's Modifier and Type Method static URL
resolveSRC(TagNode tnWithSRC, URL sourcePage)
static TagNode
resolveSRCAndUpdate(TagNode tnWithSRC, URL sourcePage)
static Vector<URL>
resolveSRCs(Iterable<TagNode> tnListWithSRC, URL sourcePage)
static Vector<URL>
resolveSRCs(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)
Resolve all HREF URL's on an HTML-Page, and Update the Page-Vector Modifier and Type Method static Ret3<int[],int[],int[]>
resolveAllHREF(Vector<? super TagNode> html, int sPos, int ePos, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
static Ret3<int[],int[],int[]>
resolveAllHREF(Vector<? super TagNode> html, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
static Ret3<int[],int[],int[]>
resolveAllHREF(Vector<? super TagNode> html, DotPair dp, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
Resolve all SRC URL's on an HTML-Page, and Update the Page-Vector Modifier and Type Method static Ret3<int[],int[],int[]>
resolveAllSRC(Vector<? super TagNode> html, int sPos, int ePos, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
static Ret3<int[],int[],int[]>
resolveAllSRC(Vector<? super TagNode> html, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
static Ret3<int[],int[],int[]>
resolveAllSRC(Vector<? super TagNode> html, DotPair dp, URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
Resolve HREF URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolveHREF_KE(TagNode tnWithHREF, URL sourcePage)
static Vector<Ret2<URL,
MalformedURLException>>resolveHREFs_KE(Iterable<TagNode> tnListWithHREF, URL sourcePage)
static Vector<Ret2<URL,
MalformedURLException>>resolveHREFs_KE(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)
Resolve SRC URL's, but Suppress Exceptions, and Keep Them Modifier and Type Method static Ret2<URL,
MalformedURLException>resolveSRC_KE(TagNode tnWithSRC, URL sourcePage)
static Vector<Ret2<URL,
MalformedURLException>>resolveSRCs_KE(Iterable<TagNode> tnListWithSRC, URL sourcePage)
static Vector<Ret2<URL,
MalformedURLException>>resolveSRCs_KE(Vector<? extends HTMLNode> html, int[] nodePosArr, URL sourcePage)
More Methods Modifier and Type Method static URL
getBaseURL(Vector<? extends HTMLNode> page)
static String[]
NON_URL_HREFS()
-
-
-
Field Detail
-
_NON_URL_HREFS
protected static final java.lang.String[] _NON_URL_HREFS
List of documented "starter-strings" that are sometimes used in Anchor URL'HREF=...'
attributes.- See Also:
NON_URL_HREFS()
- Code:
- Exact Field Declaration Expression:
protected static final String[] _NON_URL_HREFS = { "tel:", "magnet:", "javascript:", "mailto:", "ftp:", "file:", "data:", "blog:", "#" };
-
-
Method Detail
-
NON_URL_HREFS
public static java.lang.String[] NON_URL_HREFS()
This small method just returns the complete list of commonly found Anchor'HREF' String's
that do not actually constitute an HTML'URL'.
This method actually returns a "clone" of an internally storedString[]
Array. This is to protect and make sure that the list of potential HTML Anchor-Tag'HREF'
Attributes is not changed, doctored or modified- Returns:
- A clone of the
String
-array'_NON_URL_HREFS'
- See Also:
_NON_URL_HREFS
- Code:
- Exact Method Body:
return _NON_URL_HREFS.clone();
-
getBaseURL
public static java.net.URL getBaseURL (java.util.Vector<? extends HTMLNode> page) throws MalformedHTMLException, java.net.MalformedURLException
The methods in this class will not automatically extract any HTML<BASE HREF=URL>
definitions that are found on this page. If the user wishes to dereference partial / relativeURL
definitions that exist on the input page, all the while respecting any<BASE HREF=URL>
definitions found on the input page, then this method should be utilized.- Parameters:
page
- This may be any HTML page or partial page. If this page has a valid HTML<BASE HREF=URL>
, it will be extracted and returned as an instance ofclass URL
.- Returns:
- This shall return the HTML
<BASE HREF="http://...">
element found available within the input-page parameter'page'
. If the page provided does not contain aBASE URL
definition, then null shall be returned.
NOTE: The HTML Specification clearly states that only oneURL
may be defined using the HTML Element<BASE>
. Clearly, due to the browser wars, unspecified / non-deterministic behavior is possible if multiple definitions are provided. For the purposes of this class, if such a situation arises, an exception is thrown. - Throws:
MalformedHTMLException
- If the HTML page provided contains multiple definitions of the element<BASE HREF=URL>
, then this exception will throw.java.net.MalformedURLException
- If the<BASE HREF=URL>
found / identified within the input page, but thatURL
is invalid, then this exception shall throw.- See Also:
TagNodeFind
,Attributes.retrieve(Vector, int[], String)
- Code:
- Exact Method Body:
int[] posArr = TagNodeFind.all(page, TC.OpeningTags, "base"); if (posArr.length == 0) return null; // NOTE: The cast is all right because 'posArr' only points to TagNode's // Attributes expects to avoid processing Vector<TextNode>, and Vector<CommentNode> // Above, there will be nothing in the 'posArr' if either of those was passed. @SuppressWarnings("unchecked") String[] urls = Attributes.retrieve((Vector<HTMLNode>) page, posArr, "href"); boolean found = false; String ret = null; for (String url : urls) if ((url != null) && (url.length() > 0)) if (found) throw new MalformedHTMLException( "The page you have provided has multiple <BASE HREF=URL> definitions. " + "However, the HTML Specifications state that pages may provide just one " + "definition. If you wish to proceed, retrieve the definitions manually " + "using class TagNodeFind.all and Attributes.retrieve, as explained in " + "the JavaDoc pages for this class." ); else { found = true; ret = url; } return new URL(ret);
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, DotPair dp, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
-
resolveAllSRC
public static Ret3<int[],int[],int[]> resolveAllSRC (java.util.Vector<? super TagNode> html, int sPos, int ePos, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
This method shall resolve all partialURL
addresses that are found withinTagNode
elements having'SRC=...'
attributes. Each instance ofTagNode
found in the input HTMLVector
that has an'SRC'
attribute - if the'URL'
is only partially resolve - shall be updated and replaced with a newTagNode
with a fully resolvedURL
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'
means that aVector<TagNode>
or aVector<HTMLNode>
are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>
is passed, and there are no instances ofclass TagNode
contained by that Vector, then this method will simply exit gracefully.sPos
- This is the (integer)Vector
-index that sets a limit for the left-mostVector
-position to inspect/search inside the inputVector
-parameter.
This value is considered 'inclusive' meaning that theHTMLNode
at thisVector
-index will be visited by this method.
NOTE: If this value is negative, or larger than the length of the input-Vector
, an exception will be thrown.ePos
- This is the (integer)Vector
-index that sets a limit for the right-mostVector
-position to inspect/search inside the inputVector
-parameter.
This value is considered 'exclusive' meaning that the'HTMLNode'
at thisVector
-index will not be visited by this method.
NOTE: If this value is larger than the size of input theVector
-parameter, an exception will throw.
ALSO: Passing a negative value to this parameter,'ePos'
, will cause its value to be reset to the size of the inputVector
-parameter.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in the HTML-Vector
will be resolved.quote
- A choice for the quotes to use. In most cases,URL
attribute values do not contain quotation-marks. So likely either choice would work just fine, without exceptions.
NOTE: null may be passed to this parameter, and if it is the original quotation marks found in theTagNode's 'SRC'
attribute will be reused. Passing null to this parameter should almost always be easiest, safest.askForReturnArraysOrReturnNull
- This (long-named) parameter is merely here to facilitate retrieving more information from this method - if necessary. When this parameter receives the following values:- TRUE: Three integer
int[]
arrays will be returned as listed in theReturns:
section of this method's documentation. - FALSE: This method shall return null.
- TRUE: Three integer
- Returns:
- If input parameter
'askForReturnArraysOrReturnNull'
has been passedFALSE
, this method shall return null. Otherwise, (if passedTRUE
), then this method shall return an instance of'Ret3<int[], int[], int[]>'
- which is returning three separate integer-arrays about what was found, and what has occurred.
Three arrays are returned as a result of this method's invocation. Keep in mind that though the information might be superfluous, rejecting these arrays away is easy. They are provided as a matter of convenience for cases where more details information is mandatory for ensuring that long lists ofHTMLNode's
were properly updated.-
Ret3.a (int[])
The firstint[] array
shall contain a list of the index of everyTagNode
in the input-Vector
parameter's range that contained a non-null HTML'SRC'
Attribute.
-
Ret3.b (int[])
The secondint[] array
will contain an index-list of the indices which containedTagNode's
that were replaced by the internal-resolve logic.
-
Ret3.c (int[])
The thirdint[] array
will contain an index-list of the indices which containedTagNode's
whose'SRC=...'
attribute failed to be resolved by the internal-resolve logic, or caused aQuotesException
to throw.
-
- Throws:
java.lang.IndexOutOfBoundsException
- This exception shall be thrown if any of the following are true:- If
'sPos'
is negative, or ifsPos
is greater-than-or-equal-to thesize
of theVector
- If
'ePos'
is zero, or greater than the size of theVector
- If the value of
'sPos'
is a larger integer than'ePos'
. If'ePos'
was negative, it is first reset toVector.size()
, before this check is done.
- If
- See Also:
resolve(String, URL)
,TagNode.AV(String)
,TagNode.setAV(String, String, SD)
- Code:
- Exact Method Body:
// Retrieve the Vector-location of any TagNode on the page that has // a "SRC=..." attribute. These are almost always HTML <IMG> elements. // NOTE: FIND Method's are "READ ONLY" - the Cast will make no difference at run-time. // The @SuppressWarnings is to overcome the cast of 'html' @SuppressWarnings("unchecked") int[] hasSrcPosArr = InnerTagFind.all((Vector<HTMLNode>) html, sPos, ePos, "src"); // Java Stream's are convenient for keeping "Growing Lists" of return values. // This builder shall keep a list of all URL's that failed to update - for any reason // **UNLESS** the reason is that the URL was already a fully-resolved, non-partial URL IntStream.Builder failedUpdate = askForReturnArraysOrReturnNull ? IntStream.builder() : null; // This stream will keep a list of all URL's that were updated, and whose TagNode's // were replaced inside the input HTML Vector IntStream.Builder replaced = askForReturnArraysOrReturnNull ? IntStream.builder() : null; for (int pos : hasSrcPosArr) { // Get the node at the index TagNode tn = (TagNode) html.elementAt(pos); // 1) Retrieve the SRC Attribute // 2) if it is a partial-URL resolve it // 3) Convert to a String String oldURL = tn.AV("src"); URL newURL = resolve(oldURL, sourcePage); // Some URL's cannot be resolved, if so, just skip this TagNode. // Log the index to the stream (if requested), and continue. if (newURL == null) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // If the URL was already a fully-resolved-URL, continue - don't replace the TagNode; // No logging needed here, the URL was *already* resolved... if (oldURL.length() == newURL.toString().length()) continue; // Replace the SRC Attribute in the TagNode. This builds a new instance of TagNode // If there is an exception, log the index to the stream (if requested), and continue. try { tn = tn.setAV("src", newURL.toString(), quote); } catch (QuotesException qex) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // Replace the index in the Vector containing the old TagNode with the new one. html.setElementAt(tn , pos); // The Vector-Index at this position had it's old TagNode removed and replaced with a // new updated one. Log this to the stream-list so to allow the user to know. if (askForReturnArraysOrReturnNull) replaced.accept(pos); } return askForReturnArraysOrReturnNull ? new Ret3<int[], int[], int[]> (hasSrcPosArr, replaced.build().toArray(), failedUpdate.build().toArray()) : null;
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, DotPair dp, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
-
resolveAllHREF
public static Ret3<int[],int[],int[]> resolveAllHREF (java.util.Vector<? super TagNode> html, int sPos, int ePos, java.net.URL sourcePage, SD quote, boolean askForReturnArraysOrReturnNull)
This method shall resolve all partialURL
addresses that are found withinTagNode
elements having'HREF=...'
attributes. Each instance ofTagNode
found in the input HTMLVector
that has an'HREF'
attribute - if the'URL'
is only partially resolve - shall be updated and replaced with a newTagNode
with a fully resolvedURL
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? super TagNode'
means that aVector<TagNode>
or aVector<HTMLNode>
are both accepted by this parameter. They will not cause an exception throw.
Note that if aVector<Object>
is passed, and there are no instances ofclass TagNode
contained by that Vector, then this method will simply exit gracefully.sPos
- This is the (integer)Vector
-index that sets a limit for the left-mostVector
-position to inspect/search inside the inputVector
-parameter.
This value is considered 'inclusive' meaning that theHTMLNode
at thisVector
-index will be visited by this method.
NOTE: If this value is negative, or larger than the length of the input-Vector
, an exception will be thrown.ePos
- This is the (integer)Vector
-index that sets a limit for the right-mostVector
-position to inspect/search inside the inputVector
-parameter.
This value is considered 'exclusive' meaning that the'HTMLNode'
at thisVector
-index will not be visited by this method.
NOTE: If this value is larger than the size of input theVector
-parameter, an exception will throw.
ALSO: Passing a negative value to this parameter,'ePos'
, will cause its value to be reset to the size of the inputVector
-parameter.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in the HTML-Vector
will be resolved.quote
- A choice for the quotes to use. In most cases,URL
attribute values do not contain quotation-marks. So likely either choice would work just fine, without exceptions.
NOTE: null may be passed to this parameter, and if it is the original quotation marks found in theTagNode's 'HREF'
attribute will be reused. Passing null to this parameter should almost always be easiest, safest.askForReturnArraysOrReturnNull
- This (long-named) parameter is merely here to facilitate retrieving more information from this method - if necessary. When this parameter receives the following values:- TRUE: Three integer
int[]
arrays will be returned as listed in theReturns:
section of this method's documentation. - FALSE: This method shall return null.
- TRUE: Three integer
- Returns:
- If input parameter
'askForReturnArraysOrReturnNull'
has been passedFALSE
, this method shall return null. Otherwise, (if passedTRUE
), then this method shall return an instance of'Ret3<int[], int[], int[]>'
- which is returning three separate integer-arrays about what was found, and what has occurred.
Three arrays are returned as a result of this method's invocation. Keep in mind that though the information might be superfluous, rejecting these arrays away is easy. They are provided as a matter of convenience for cases where more details information is mandatory for ensuring that long lists ofHTMLNode's
were properly updated.-
Ret3.a (int[])
The firstint[] array
shall contain a list of the index of everyTagNode
in the input-Vector
parameter's range that contained a non-null HTML'HREF'
Attribute.
-
Ret3.b (int[])
The secondint[] array
will contain an index-list of the indices which containedTagNode's
that were replaced by the internal-resolve logic.
-
Ret3.c (int[])
The thirdint[] array
will contain an index-list of the indices which containedTagNode's
whose'HREF=...'
attribute failed to be resolved by the internal-resolve logic, or caused aQuotesException
to throw.
-
- Throws:
java.lang.IndexOutOfBoundsException
- This exception shall be thrown if any of the following are true:- If
'sPos'
is negative, or ifsPos
is greater-than-or-equal-to thesize
of theVector
- If
'ePos'
is zero, or greater than the size of theVector
- If the value of
'sPos'
is a larger integer than'ePos'
. If'ePos'
was negative, it is first reset toVector.size()
, before this check is done.
- If
- See Also:
resolve(String, URL)
,TagNode.AV(String)
,TagNode.setAV(String, String, SD)
- Code:
- Exact Method Body:
// Retrieve the Vector-location of any TagNode on the page that has // a "HREF=..." attribute. These are almost always HTML <IMG> elements. // NOTE: FIND Method's are "READ ONLY" - the Cast will make no difference at run-time. // The @SuppressWarnings is to overcome the cast of 'html' @SuppressWarnings("unchecked") int[] hasHRefPosArr = InnerTagFind.all((Vector<HTMLNode>) html, sPos, ePos, "href"); // Java Stream's are convenient for keeping "Growing Lists" of return values. // This builder shall keep a list of all URL's that failed to update - for any reason // **UNLESS** the reason is that the URL was already a fully-resolved, non-partial URL IntStream.Builder failedUpdate = askForReturnArraysOrReturnNull ? IntStream.builder() : null; // This stream will keep a list of all URL's that were updated, and whose TagNode's // were replaced inside the input HTML Vector IntStream.Builder replaced = askForReturnArraysOrReturnNull ? IntStream.builder() : null; for (int pos : hasHRefPosArr) { // Get the node at the index TagNode tn = (TagNode) html.elementAt(pos); // 1) Retrieve the HREF Attribute // 2) if it is a partial-URL resolve it // 3) Convert to a String String oldURL = tn.AV("HREF"); URL newURL = resolve(oldURL, sourcePage); // Some URL's cannot be resolved, if so, just skip this TagNode. // Log the index to the stream (if requested), and continue. if (newURL == null) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // If the URL was already a fully-resolved-URL, continue - don't replace the TagNode; // No logging needed here, the URL was *already* resolved... if (oldURL.length() == newURL.toString().length()) continue; // Replace the HREF Attribute in the TagNode. This builds a new instance of TagNode // If there is an exception, log the index to the stream (if requested), and continue. try { tn = tn.setAV("href", newURL.toString(), quote); } catch (QuotesException qex) { if (askForReturnArraysOrReturnNull) failedUpdate.accept(pos); continue; } // Replace the index in the Vector containing the old TagNode with the new one. html.setElementAt(tn , pos); // The Vector-Index at this position had it's old TagNode removed and replaced with a // new updated one. Log this to the stream-list so to allow the user to know. if (askForReturnArraysOrReturnNull) replaced.accept(pos); } return askForReturnArraysOrReturnNull ? new Ret3<int[], int[], int[]> (hasHRefPosArr, replaced.build().toArray(), failedUpdate.build().toArray()) : null;
-
resolveHREFAndUpdate
public static TagNode resolveHREFAndUpdate(TagNode tnWithHREF, java.net.URL sourcePage)
-
resolveHREF
public static java.net.URL resolveHREF(TagNode tnWithHREF, java.net.URL sourcePage)
This should be used forTagNode's
that contain an'HREF'
inner-tag (attribute).- Parameters:
tnWithHREF
- This may be any HTML Element that contains an'HREF'
attribute.
NOTE: An HTML 'anchor' element (< HREF=...>
) will contain these. Often theURL's
found here contain "relative" rather than "absolute" addresses.sourcePage
- This is the source pageURL
from which theTagNode
(possibly-relative)URL
will be resolved.- Returns:
- A complete-
URL
without any missing "presumed data" - such as host/domain or directory. Null is returned if attempting to build theURL
generated aMalformedURLException
.
SPECIFICALLY: This method shall catch allMalformedURLException's
. - Throws:
HREFException
- If theTagNode
passed to parameter'tnWithHREF'
does not actually contain anHREF
attribute, then this exception shall throw.- See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
String href = tnWithHREF.AV("href"); if (href == null) throw new HREFException( "The TagNode passed to parameter tnWithHREF does not actually contain an " + "HREF attribute." ); return resolve(href, sourcePage);
-
resolveSRCAndUpdate
public static TagNode resolveSRCAndUpdate(TagNode tnWithSRC, java.net.URL sourcePage)
-
resolveSRC
public static java.net.URL resolveSRC(TagNode tnWithSRC, java.net.URL sourcePage)
This should be used forTagNode's
that contain a'SRC'
inner-tag (attribute).- Parameters:
tnWithSRC
- This may be any HTML Element that contains a'SRC'
attribute.
NOTE: An HTML 'image' element (<IMG SRC=...>
) will contain these. Often theURL's
found here contain "relative" rather than "absolute" addresses.sourcePage
- This is the source pageURL
from which theTagNode
(possibly-relative)URL
will be resolved.- Returns:
- A complete-
URL
without any missing "presumed data" - such as host/domain or directory. Null is returned if attempting to build theURL
generated aMalformedURLException
.
SPECIFICALLY: This method shall catch allMalformedURLException's
. - Throws:
SRCException
- If theTagNode
passed to parameter'tnWithSRC'
does not actually contain aSRC
attribute, then this exception shall throw.- See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
String src = tnWithSRC.AV("src"); if (src == null) throw new SRCException( "The TagNode passed to parameter tnWithSRC does not actually contain a " + "SRC attribute." ); return resolve(src, sourcePage);
-
resolveHREFs
public static java.util.Vector<java.net.URL> resolveHREFs (java.lang.Iterable<TagNode> tnListWithHREF, java.net.URL sourcePage)
This should be used for lists ofTagNode's
, each of which contain an'HREF'
inner-tag (attribute).- Parameters:
tnListWithHREF
- This may be any list of HTML Elements, each of which must be instances ofclass TagNode
and all of which must have a'HREF'
attribute.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theIterable
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. AnyTagNode
which generated an exception, will result in a null value in theVector
.
SPECIFICALLY: If any of the elements intnListWithHREF
do not contain anHREF
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (TagNode tn : tnListWithHREF) ret.addElement(resolve(tn.AV("href"), sourcePage)); return ret;
-
resolveSRCs
public static java.util.Vector<java.net.URL> resolveSRCs (java.lang.Iterable<TagNode> tnListWithSRC, java.net.URL sourcePage)
This should be used for lists ofTagNode's
, each of which contain a'SRC'
inner-tag (attribute).- Parameters:
tnListWithSRC
- This may be any list of HTML Elements, each of which must be instances ofclass TagNode
and all of which must have a'SRC'
attribute.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theIterable
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. AnyTagNode
which generated an exception, will result in a null value in theVector.
SPECIFICALLY: If any of the elements intnListWithSRC
do not contain aSRC
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (TagNode tn : tnListWithSRC) ret.addElement(resolve(tn.AV("src"), sourcePage)); return ret;
-
resolveHREFs
public static java.util.Vector<java.net.URL> resolveHREFs (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's
. TheTagNode's
to which this pointer-array points - must each contain anHREF
inner-tag with aURL
, or a partialURL
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage.nodePosArr
- An array of pointers into the page or sub-page. The pointers must referenceTagNode's
that containHREF
attributes. Integer-pointer Arrays are usually returned from thepackage 'NodeSearch'
"Find" methods.
Example:
// Retrieve 'pointers' to all the '<A HREF=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] anchorPosArr = TagNodeFind.all(page, TC.OpeningTags, "a"); // Extract each HREF inner-tag, and construct a {@code URL}. Use the 'sourcePage' parameter // if the URL is only partially-resolved Vector<URL> urls = Links.resolveHREFs(page, anchorPosArr, mySourcePage);
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<A ...>"
element that was available in the HTML page-Vector
parameter'html'
, and then resolve any shortenedURL's
.sourcePage
- This is the source pageURL
from whence the (possibly relative)TagNode URL's
in theVector
are to be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. AnyTagNode
which generated an exception, will result in a null value in theVector
. However, if any of the nodes pointed to by the'nodePosArr'
parameter do not contain openingTagNode
elements, then this mistake shall generateTagNodeExpectedException's
.
SPECIFICALLY: If any of the elements intnListWithHREF
do not contain anHREF
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If any of the elements in'posArr'
contain index-pointers that are out of range ofVector
-parameter'page'
, then java will, naturally, throw this exception.OpeningTagNodeExpectedException
- When aVector
position-index holds an instance ofTagNode
, but thatTagNode
is one in which itsisClosing
-Field is set toTRUE
, then this exception shall throw.
When passingint[]
-Array parameter'posArr'
, that array should contain a list ofVector
-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException
- This exception shall throw if an identifiedVector
-index must point-to an instance ofTagNode
, but that index instead holds some otherHTMLNode
instance (eitherCommentNode
orTextNode
). If an integer-position array (int[] posArr
) is passed, but that array has an index pointing-to - something besides aTagNode
- then this exception will be thrown.- See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
// Return Vector Vector<URL> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the 'HREF', save the URL ret.addElement(resolve(tn.AV("href"), sourcePage)); } return ret;
-
resolveSRCs
public static java.util.Vector<java.net.URL> resolveSRCs (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's
. TheTagNode's
to which this pointer-array points - must each contain aSRC
inner-tag with aURL
, or a partialURL
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage. Any HTML page (or sub-page)nodePosArr
- An array of pointers into the page or sub-page. The pointers must referenceTagNode's
that containSRC
attributes. Integer-pointer Arrays are usually returned from thepackage 'NodeSearch'
"Find" methods.
Example:
// Retrieve 'pointers' to all the '<IMG SRC=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] picturePosArr = TagNodeFind.all(page, TC.OpeningTags, "img"); // Extract each SRC inner-tag, and construct a {@code URL}. Use the 'sourcePage' parameter // if the URL is only partially-resolved Vector<URL> urls = Links.resolveSRCs(page, picturePosArr, mySourcePage);
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<IMG ...>"
element that was available in the HTML page-Vector
parameter'html'
, and then resolve any shorted imageURL's
.sourcePage
- This is the source pageURL
from whence the (possibly relative)TagNode URL's
in theVector
are to be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. AnyTagNode
which generated an exception, will result in a null value in theVector
. However, if any of the nodes pointed to by the'nodePosArr'
parameter do not contain openingTagNode
elements, then this mistake shall generateTagNodeExpectedException's
.
SPECIFICALLY: If any of the elements intnListWithSRC
do not contain aSRC
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- If any of the elements in'posArr'
contain index-pointers that are out of range ofVector
-parameter'page'
, then java will, naturally, throw this exception.OpeningTagNodeExpectedException
- When aVector
position-index holds an instance ofTagNode
, but thatTagNode
is one in which itsisClosing
-Field is set toTRUE
, then this exception shall throw.
When passingint[]
-Array parameter'posArr'
, that array should contain a list ofVector
-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException
- This exception shall throw if an identifiedVector
-index must point-to an instance ofTagNode
, but that index instead holds some otherHTMLNode
instance (eitherCommentNode
orTextNode
). If an integer-position array (int[] posArr
) is passed, but that array has an index pointing-to - something besides aTagNode
- then this exception will be thrown.- See Also:
resolve(String, URL)
,TagNode.AV(String)
- Code:
- Exact Method Body:
// Return Vector Vector<URL> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the "SRC", save the URL ret.addElement(resolve(tn.AV("src"), sourcePage)); } return ret;
-
resolve
public static java.util.Vector<java.net.URL> resolve (java.util.Vector<java.lang.String> src, java.net.URL sourcePage)
This will convert a list of simple javaString's
to a list/Vector
ofURL's
, de-referencing any missing information using the'sourcePage'
parameter.- Parameters:
src
- a list of strings - usually partially or totally completed InternetURL's
sourcePage
- This is the source pageURL
from which theString's
(possibly-relative)URL's
in theVector
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyString's
that were zero-length or null, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will be null. - See Also:
resolve(String, URL)
- Code:
- Exact Method Body:
Vector<URL> ret = new Vector<>(); for (String s : src) ret.addElement(resolve(s, sourcePage)); return ret;
-
resolve
public static java.net.URL resolve(java.lang.String src, java.net.URL sourcePage)
This will convert a simple javaString
to aURL
, de-referencing any missing information using the'sourcePage'
parameter.- Parameters:
src
- Any javaString
, usually one which was scraped from an HTML-Page, and needs to be "completed."sourcePage
- This is the source pageURL
from which the String (possibly-relative)URL
will be resolved.- Returns:
- A
URL
, which has been completed/resolved with the'sourcePage'
parameter. If parameter'src'
is null or zero-length, then this method will also return null. If aMalformedURLException
is generated, null will also be returned. - Code:
- Exact Method Body:
if (sourcePage == null) throw new NullPointerException( "Though you may provide null to the partial-URL to dereference parameter, null " + "may not be passed to the Source-Page Parameter. The purpose of the 'resolve' " + "operation is to resolve partial-URLs against a source-page (root) URL. " + "Therefore this is not allowed." ); if (src == null) return null; src = src.trim(); if (src.length() == 0) return null; String srcLC = src.toLowerCase(); if (StrCmpr.startsWithXOR(srcLC, _NON_URL_HREFS)) return null; if (srcLC.startsWith("http://") || srcLC.startsWith("https://")) try { return new URL(src); } catch (MalformedURLException e) { return null; } if (src.startsWith("//") && (src.charAt(3) != '/')) try { return new URL(sourcePage.getProtocol().toLowerCase() + ":" + src); } catch (MalformedURLException e) { return null; } if (src.startsWith("/")) try { return new URL( sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase() + src ); } catch (MalformedURLException e) { return null; } if (src.startsWith("../")) { String sourcePageStr = sourcePage.toString(); short nLevels = 0; do { nLevels++; src = src.substring(3); } while (src.startsWith("../")); String directory = StringParse.dotDotParentDirectory(sourcePage.toString(), nLevels); try { return new URL(directory + src); } catch (Exception e) { return null; } } String root = sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase(); String path = sourcePage.getPath().trim(); int pos = StringParse.findLastFrontSlashPos(path); if (pos == -1) throw new StringIndexOutOfBoundsException( "The URL you have provided: " + sourcePage.toString() + " does not have a '/' " + "front-slash character in it's path. Cannot proceed resolving relative-URL's " + "without this." ); path = path.substring(0, pos + 1); try { return new URL(root + path + src); } catch (MalformedURLException e) { return null; }
-
resolveHREF_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolveHREF_KE (TagNode tnWithHREF, java.net.URL sourcePage)
This should be used forTagNode's
that contain an'HREF'
inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
tnWithHREF
- This may be any HTML Element that contains an'HREF'
attribute.
NOTE: An HTML 'anchor' element (< HREF=...>
) will contain these. Often theURL's
found here contain "relative" rather than "absolute" addresses.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL
will be resolved.- Returns:
- A complete-
URL
without any missing "presumed data" - such as host/domain or directory. If there were noHREF
tag, then null is returned. If theTagNode
causes aMalformedURLException
, that is returned inRet2.b
SPECIFICALLY: This method shall catch allMalformedURLException's
.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- Throws:
HREFException
- If theTagNode
passed to parameter'tnWithHREF'
does not actually contain anHREF
attribute, then this exception shall throw.- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
String href = tnWithHREF.AV("href"); if (href == null) throw new HREFException( "The TagNode passed to parameter tnWithHREF does not actually contain an " + "HREF attribute." ); return resolve_KE(href, sourcePage);
-
resolveSRC_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolveSRC_KE (TagNode tnWithSRC, java.net.URL sourcePage)
This should be used forTagNode's
that contain a'SRC'
inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
tnWithSRC
- This may be any HTML Element that contains a'SRC'
attribute.
NOTE: An HTML 'image' element (<IMG SRC=...>
) will contain these. Often theURL's
found here contain "relative" rather than "absolute" addresses.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL
will be resolved.- Returns:
- A complete-
URL
without any missing "presumed data" - such as host/domain or directory. If there were noSRC
tag, then null is returned. If theTagNode
causes aMalformedURLException
, that is returned inRet2.b
SPECIFICALLY: This method shall catch allMalformedURLException's
.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- Throws:
SRCException
- If theTagNode
passed to parameter'tnWithSRC'
does not actually contain aSRC
attribute, then this exception shall throw.- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
String src = tnWithSRC.AV("src"); if (src == null) throw new SRCException( "The TagNode passed to parameter tnWithSRC does not actually contain a " + "SRC attribute." ); return resolve_KE(src, sourcePage);
-
resolveHREFs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveHREFs_KE (java.lang.Iterable<TagNode> tnListWithHREF, java.net.URL sourcePage)
This should be used for lists ofTagNode's
, each of which contain an'HREF'
inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
tnListWithHREF
- This may be any list of HTML Elements, each of which must be instances ofclass TagNode
and all of which must have a'HREF'
attribute.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theIterable
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyTagNode
with noHREF
tag, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will contain the exception inRet2.b
SPECIFICALLY: If any of the elements intnListWithHREF
do not contain anHREF
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (TagNode tn : tnListWithHREF) ret.addElement(resolve_KE(tn.AV("href"), sourcePage)); return ret;
-
resolveSRCs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveSRCs_KE (java.lang.Iterable<TagNode> tnListWithSRC, java.net.URL sourcePage)
This should be used for lists ofTagNode's
, each of which contain a'SRC'
inner-tag (attribute).
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
tnListWithSRC
- This may be any list of HTML Elements, each of which must be instances ofclass TagNode
and all of which must have a'SRC'
attribute.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theIterable
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyTagNode
with noSRC
tag, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will contain the exception inRet2.b
SPECIFICALLY: If any of the elements intnListWithSRC
do not contain aSRC
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (TagNode tn : tnListWithSRC) ret.addElement(resolve_KE(tn.AV("src"), sourcePage)); return ret;
-
resolveHREFs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveHREFs_KE (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's
. TheTagNode
to which this pointer-array points - must containHREF
inner-tags withURL's
, or partialURL's
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage. Any HTML page (or sub-page)nodePosArr
- An array of pointers into the page or sub-page. The pointers must referenceTagNode's
that containHREF
attributes. Integer-pointer Arrays are usually return from thepackage 'NodeSearch'
"Find" methods.
Example:
// Retrieve 'pointers' to all the '<A HREF=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] anchorPosArr = TagNodeFind.all(page, TC.OpeningTags, "a"); // Extract each HREF inner-tag, and construct a URL. Use the 'sourcePage' parameter if // the URL is only partially-resolved. If any URL's on the original-page are invalid, the // method shall not crash, but save the exception instead. Vector<Ret2<URL, MalformedURLException> urlsWithEx = Links.resolveHREFs_KE(page, picturePosArr, mySourcePage); // Print out any "failed" urls for (Ret2<URL, MalformedURLException> r : urlsWithEx) if (r.b != null) System.out.println("There was an exception: " + r.b.toString());
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<A ...>"
element that was available in the HTML page-Vector
parameter'html'
., and then resolve any shortenedURL's
.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theVector
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyTagNode
with noHREF
tag, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will contain the exception inRet2.b
SPECIFICALLY: If any of the elements intnListWithHREF
do not contain anHREF
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- Throws:
java.lang.ArrayIndexOutOfBoundsException
- If any of the elements in'posArr'
contain index-pointers that are out of range ofVector
-parameter'page'
, then java will, naturally, throw this exception.OpeningTagNodeExpectedException
- When aVector
position-index holds an instance ofTagNode
, but thatTagNode
is one in which itsisClosing
-Field is set toTRUE
, then this exception shall throw.
When passingint[]
-Array parameter'posArr'
, that array should contain a list ofVector
-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException
- This exception shall throw if an identifiedVector
-index must point-to an instance ofTagNode
, but that index instead holds some otherHTMLNode
instance (eitherCommentNode
orTextNode
). If an integer-position array (int[] posArr
) is passed, but that array has an index pointing-to - something besides aTagNode
- then this exception will be thrown.- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
// Return Vector Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve the "HREF", keep the URL ret.addElement(resolve_KE(tn.AV("href"), sourcePage)); } return ret;
-
resolveSRCs_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolveSRCs_KE (java.util.Vector<? extends HTMLNode> html, int[] nodePosArr, java.net.URL sourcePage)
This will use a "pointer array" - an array containing indexes into the downloaded page to retrieveTagNode's
. TheTagNode
to which this pointer-array points - must containSRC
inner-tags withURL's
, or partialURL's
.
HTML's<BASE HREF=...>
Methods in this class which accept a complete (or partial) HTMLVector
(using a parameter such asVector<HTMLNode>
) must take care to check if the page provided has a definition for HTML Element<BASE HREF=URL>
.
If the input page has such a definition, none of the methods in this class will actually heed it (at all), and therefore the user must manually invoke the method getBaseURL(Vector) in order to retrieve thatURL
, and then pass that result to input-parametersourcePage
.
More recently, HTML-Pages are making less use of<BASE>
HTML-Tag.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
html
- This may be any Vectorized-HTML Web-Page (or sub-page).
The Variable-Type Wild-Card Expression'? extends HTMLNode'
means that aVector<TagNode>, Vector<TextNode>
orVector<CommentNode>
will all be accepted by this paramter without causing an exception throw.
These 'sub-type' Vectors are often returned as search results from the classes in the'NodeSearch'
vpackage. Any HTML page (or sub-page)nodePosArr
- An array of pointers into the page or sub-page. The pointers must referenceTagNode's
that containSRC
attributes. Integer-pointer Arrays are usually return from thepackage 'NodeSearch'
"Find" methods.
Example:
// Retrieve 'pointers' to all the '<IMG SRC=...>' TagNode's. The term 'pointer' refers to // integer-indices into the vectorized-html variable 'page' int[] picturePosArr = TagNodeFind.all(page, TC.OpeningTags, "img"); // Extract each SRC inner-tag, and construct a URL. Use the 'sourcePage' parameter if // the URL is only partially-resolved. If any URL's on the original-page are invalid, // the method shall not crash, but save the exception instead. Vector<Ret2<URL, MalformedURLException> urlsWithEx = Links.resolveSRCs_KE(page, picturePosArr, mySourcePage); // Print out any "failed" urls for (Ret2<URL, MalformedURLException> r : urlsWithEx) if (r.b != null) System.out.println("There was an exception: " + r.b.toString());
which would obtain a pointer-array / (a.k.a. a "vector-index-array") to every HTML"<IMG ...>"
element that was available in the HTML page-Vector
parameter'html'
, and then resolve any shortenedURL's
.sourcePage
- This is the source pageURL
from which theTagNode's
(possibly-relative)URL's
in theVector
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyTagNode
with noSRC
tag, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will contain the exception inRet2.b
SPECIFICALLY: If any of the elements intnListWithSRC
do not contain aSRC
inner-tag, then the method will default, and also cause a null return value in theVector
. Note that the primary impetus for returning 'null' rather than throwing an exception is due to cases where large numbers of links from a web-page are being de-referenced, skipping over "broken URL's" makes for simpler coding.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- Throws:
java.lang.ArrayIndexOutOfBoundsException
- If any of the elements in'posArr'
contain index-pointers that are out of range ofVector
-parameter'page'
, then java will, naturally, throw this exception.OpeningTagNodeExpectedException
- When aVector
position-index holds an instance ofTagNode
, but thatTagNode
is one in which itsisClosing
-Field is set toTRUE
, then this exception shall throw.
When passingint[]
-Array parameter'posArr'
, that array should contain a list ofVector
-indices. The code which checks for this exception checks to ensure that each of the locations in that array point to Opening TagNode's, and if or when they don't, this exception throws.TagNodeExpectedException
- This exception shall throw if an identifiedVector
-index must point-to an instance ofTagNode
, but that index instead holds some otherHTMLNode
instance (eitherCommentNode
orTextNode
). If an integer-position array (int[] posArr
) is passed, but that array has an index pointing-to - something besides aTagNode
- then this exception will be thrown.- See Also:
resolve_KE(String, URL)
,TagNode.AV(String)
,Ret2
- Code:
- Exact Method Body:
// Return Vector Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (int nodePos : nodePosArr) { HTMLNode n = html.elementAt(nodePos); // Must be an HTML TagNode if (! n.isTagNode()) throw new TagNodeExpectedException(nodePos); TagNode tn = (TagNode) n; // Must be an "Opening" HTML TagNode if (tn.isClosing) throw new OpeningTagNodeExpectedException(nodePos); // Resolve "SRC" and keep URL's ret.addElement(resolve_KE(tn.AV("src"), sourcePage)); } return ret;
-
resolve_KE
public static java.util.Vector<Ret2<java.net.URL,java.net.MalformedURLException>> resolve_KE (java.util.Vector<java.lang.String> src, java.net.URL sourcePage)
Resolve allURL's
, represented asString's
, inside of aVector
.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
src
- a list ofString's
- usually partially or totally completed InternetURL's
sourcePage
- This is the source pageURL
from which theString's
(possibly-relative)URL's
in theVector
will be resolved.- Returns:
- A list of
URL's
, each of which have been completed/resolved with the'sourcePage'
parameter. If there were anyString's
that were zero-length or null, then null is returned in the relatedVector
position. If anyTagNode
causes aMalformedURLException
, then that position in theVector
will contain the exception inRet2.b
Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- See Also:
resolve_KE(String, URL)
,Ret2
- Code:
- Exact Method Body:
Vector<Ret2<URL, MalformedURLException>> ret = new Vector<>(); for (String s : src) ret.addElement(resolve_KE(s, sourcePage)); return ret;
-
resolve_KE
public static Ret2<java.net.URL,java.net.MalformedURLException> resolve_KE (java.lang.String src, java.net.URL sourcePage)
This will convert a simple javaString
to aURL
, de-referencing any missing information using the'sourcePage'
parameter.
KE: - Keep Exceptions
If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).
Within the Pair-Tuple,Ret2<URL, MalformedURLException>:
, precisely one of the two references will be non-null. If theURL
was properly resolved, then theURL
field (fieldRet2.a
) will be non-null. Otherwise, theMalformedURLException
field (fieldRet2.b
) will be non-null.- Parameters:
src
- Any javaString
, usually one which was scraped from an HTML-Page, and needs to be "completed."sourcePage
- This is the source pageURL
from which the String (possibly relative)URL
will be resolved.- Returns:
- A
URL
, which has been completed/resolved with the'sourcePage'
parameter. If parameter'src'
is null or zero-length, null will be returned. If aMalformedURLException
is thrown, that will be included with theRet2<>
result.Ret2.a (URL)
This shall contain the fully resolvedURL
- resolved using the parameter'sourcePage'
as the Base-URL
.Ret2.b (MalformedURLException)
If there were any problems resolving theURL
- such that an exception was thrown while producing the resolved-URL
, the exception thrown will be caught and returned as a reference instead.
- See Also:
Ret2
- Code:
- Exact Method Body:
if (sourcePage == null) throw new NullPointerException( "Though you may provide null to the partial-URL to dereference parameter, null " + "may not be passed to the Source-Page Parameter. The purpose of the 'resolve' " + "operation is to resolve partial-URLs against a source-page (root) URL. " + "Therefore this is not allowed." ); if (src == null) return null; src = src.trim(); if (src.length() == 0) return null; String srcLC = src.toLowerCase(); if (StrCmpr.startsWithXOR (srcLC, "tel:", "javascript:", "mailto:", "magnet:", "file:", "ftp:", "#")) return new Ret2<URL, MalformedURLException> (null, new MalformedURLException( "InnerTag/Attribute begins with: " + src.substring(0, 1 + src.indexOf(":")) + ", so it is not a hyper-link." )); // Includes the first few characters of the URL - for reporting/convenience. // If this is an "image", the image-type & name will be included if (StrCmpr.startsWithXOR(srcLC, "data:", "blob:")) return new Ret2<URL, MalformedURLException>(null, new MalformedURLException( "InnerTag/Attribute begins with: " + ((src.length() > 25) ? src.substring(0, 25) : src) + ", not a URL." )); if (srcLC.startsWith("http://") || srcLC.startsWith("https://")) try { return new Ret2<URL, MalformedURLException>(new URL(src), null); } catch (MalformedURLException e) { return new Ret2<URL, MalformedURLException>(null, e); } if (src.startsWith("//") && (src.charAt(3) != '/')) try { return new Ret2<URL, MalformedURLException> (new URL( sourcePage.getProtocol().toLowerCase() + ":" + src), null); } catch (MalformedURLException e) { return new Ret2<URL, MalformedURLException>(null, e); } if (src.startsWith("/")) try { return new Ret2<URL, MalformedURLException>(new URL( sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase() + src), null ); } catch (MalformedURLException e) { return new Ret2<URL, MalformedURLException>(null, e); } if (src.startsWith("../")) { String sourcePageStr = sourcePage.toString(); short nLevels = 0; do { nLevels++; src = src.substring(3); } while (src.startsWith("../")); String directory = StringParse.dotDotParentDirectory(sourcePage.toString(), nLevels); try { return new Ret2<URL, MalformedURLException>(new URL(directory + src), null); } catch (MalformedURLException e) { return new Ret2<URL, MalformedURLException>(null, e); } catch (Exception e) { return new Ret2<URL, MalformedURLException> (null, new MalformedURLException(e.getClass().getCanonicalName() + ":" + e.getMessage()) ); } } String root = sourcePage.getProtocol().toLowerCase() + "://" + sourcePage.getHost().toLowerCase(); String path = sourcePage.getPath().trim(); int pos = StringParse.findLastFrontSlashPos(path); if (pos == -1) throw new StringIndexOutOfBoundsException( "The URL you have provided: " + sourcePage.toString() + " does not have a '/' front-slash character in it's path." + "Cannot proceed resolving relative-URL's without this." ); path = path.substring(0, pos + 1); try { return new Ret2<URL, MalformedURLException>(new URL(root + path + src), null); } catch (MalformedURLException e) { return new Ret2<URL, MalformedURLException>(null, e); }
-
-