Package Torello.Java
Class URLs
- java.lang.Object
-
- Torello.Java.URLs
-
public class URLs extends java.lang.Object
A class that plays-with URL's, no more, no less.
This provides a few utility functions for dealing with URL's.
NOTE: This class does not perform relative / absoluteURL
resolution.URL
resolution (completing a partial-URL
using the complete-URL
of the page on which the link sits) can be performed using theclass Links
found in the HTML package. This class helps analyze, just a tad, escaping certain characters found inside a Uniform Resource Link so that it may connect to a web-server, AJAX Server, JSON retriever, etc.
SUMMARY: This is an "existential" or "experimental" collection of attempts.
Hi-Lited Source-Code:- View Here: Torello/Java/URLs.java
- Open New Browser-Tab: Torello/Java/URLs.java
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 15 Method(s), 15 declared static
- 6 Field(s), 6 declared static, 6 declared final
-
-
Field Summary
Fields Modifier and Type Field protected static Pattern
P1
protected static String
RE1
protected static char[]
URL_ESC_CHARS
protected static char[]
URL_ESC_CHARS_ABBREV
protected static char[]
VOWELS
protected static String[]
VOWELS_URL
-
Method Summary
Fun with URL Escape Modifier and Type Method static String
toProperURLV1(String url)
static String
toProperURLV2(String url)
static String
toProperURLV3(String url)
static String
toProperURLV4(String url)
static String
toProperURLV5(String url)
static String
toProperURLV6(String url)
static String
toProperURLV7(String url)
static String
toProperURLV8(URL url)
Remove Partial URLs Modifier and Type Method static URL
shortenPoundREF(URL url)
static int
shortenPoundREFs(Vector<URL> urls, boolean ifExceptionSetNull)
static Ret2<Integer,
Vector<MalformedURLException>>shortenPoundREFs_KE(Vector<URL> urls, boolean ifExceptionSetNull)
Remove Duplicate URL's from a List Modifier and Type Method static int
removeDuplicates(Vector<URL> urls)
static int
removeDuplicates(Vector<URL> visitedURLs, Vector<URL> potentiallyNewURLs)
URL to String Modifier and Type Method static String
urlToString(URL url)
Java URL Object Help Modifier and Type Method protected static void
javaURLHelpMessage(StorageWriter sw)
-
-
-
Field Detail
-
RE1
protected static final java.lang.String RE1
This is a Regular-Expression Pattern(java.util.regex.Pattern)
- saved as aString
. It is subsequently compiled.
The primary function is to match String's that are intended to match HTTP-URL's.
IT MATCHES:http(s)://...<any-text>.../ http(s)://...<any-text, not front-slash>... http(s)://...<any-text>.../...<any-text, not front-slash>...
This is primarily used in methods:toProperURLV1(...), V3(...) and V4(...)
- See Also:
- Constant Field Values
- Code:
- Exact Field Declaration Expression:
protected static final String RE1 = "^(http[s]?:\\/\\/.*?\\/$|http[s]?:\\/\\/[^\\/]*$|http[s]?:\\/\\/.*?\\/[^\\/]+)";
-
P1
protected static final java.util.regex.Pattern P1
P1 = Pattern.compile(RE1);- Code:
- Exact Field Declaration Expression:
protected static final Pattern P1 = Pattern.compile(RE1);
-
VOWELS
protected static final char[] VOWELS
When scraping SpanishURL's
, these characters can / should be escaped.
Parallel Array Note: This array shall be considered parallel to theReplacement String[] Array
VOWELS_URL
.- See Also:
toProperURLV1(String)
- Code:
- Exact Field Declaration Expression:
protected static final char[] VOWELS = { 'á', 'É', 'é', 'Í', 'í', 'Ó', 'ó', 'Ú', 'ú', 'Ü', 'ü', 'Ñ', 'ñ', 'Ý', 'ý', '¿', '¡' };
-
VOWELS_URL
protected static final java.lang.String[] VOWELS_URL
When scraping SpanishURL's
, theseString's
are theURL Escape Sequences
for the Spanish Vowel Characters listed in parallel arrayVOWELS
.- See Also:
toProperURLV1(String)
- Code:
- Exact Field Declaration Expression:
protected static final String[] VOWELS_URL = { "%C3%A1", "%C3%89", "%C3%A9", "%C3%8D", "%C3%AD", "%C3%93", "%C3%B3", "%C3%9A", "%C3%BA", "%C3%9C", "%C3%BC", "%C3%91", "%C3%B1", "%C3%9D", "%C3%BD", "%C2%BF", "%C2%A1" };
-
URL_ESC_CHARS
protected static final char[] URL_ESC_CHARS
This list of javachar's
are characters that are better off escaped when passing them through aURL
.- See Also:
toProperURLV2(String)
- Code:
- Exact Field Declaration Expression:
protected static final char[] URL_ESC_CHARS = { '%', ' ', '#', '$', '&', '@', '`', '/', ':', ';', '<', '=', '>', '?', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };
-
URL_ESC_CHARS_ABBREV
protected static final char[] URL_ESC_CHARS_ABBREV
This is a (shortened) list of characters that should be escaped before being used within aURL
.
NOTE: This version does not have the'&'
(ampersand) or the'?'
(question-mark) or the'/'
(forward-slash).- See Also:
URL_ESC_CHARS
,toProperURLV4(String)
- Code:
- Exact Field Declaration Expression:
protected static final char[] URL_ESC_CHARS_ABBREV = { '%', ' ', '#', '$', '@', '`', ':', ';', '<', '=', '>', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };
-
-
Method Detail
-
javaURLHelpMessage
protected static final void javaURLHelpMessage(StorageWriter sw) throws java.io.IOException
Java Help Messag Explainingclass java.net.URL
- and the specific output of its methods. This will just print a 'friendly-reminder' to the terminal-console output showing what the actual output of theclass java.net.URL
actually is. This helps when breaking-up/resolving partial URL links and partial Image-URL links. This is a "Java-Doc StackOverflow.com-like Documentation/Comment." Generally, dealing with URL's and web-servers can be A LOT more difficult, if any European/Spanish accent characters are involved, or if the Asian Character sets are involved. There is an EXTREMELY standardized way to encode characters in just about any language in the world (and the name of that "way" is UTF-8), although and unfortunately different web-servers expect different types of "escape sequences."
NOTE: The following output was generated when scraping the City of Dallas web-server collecting E-Mail addresses for the E-Mail Distribution list regarding Human-Rights abuses (Hypno-Programming) in this city. Programmers would not be obligated to write their City-Council Man or their Congressman to use any of the material in this scrape package. However, if you are concerned about the abuses of power in the "former" United States, scraping government-websites to collect individuals e-mail addresses is very easy using this package.java.net.URL Method() String-Result u.toString() https://DALLASCITYHALL.com u.getProtocol() https u.getHost() DALLASCITYHALL.com u.getPath() u.getFile() u.getQuery() null u.getRef() null u.getAuthority() DALLASCITYHALL.com u.getUserInfo() null urlToString(u) https://dallascityhall.com u.toString() https://dallascityhall.com/ u.getProtocol() https u.getHost() dallascityhall.com u.getPath() / u.getFile() / u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/ u.toString() https://dallascityhall.com/news u.getProtocol() https u.getHost() dallascityhall.com u.getPath() /news u.getFile() /news u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news u.toString() https://dallascityhall.com/news/ u.getProtocol() https u.getHost() dallascityhall.com u.getPath() /news/ u.getFile() /news/ u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news/ u.toString() http://DALLASCITYHALL.com/news/ARTICLE-1.html u.getProtocol() http u.getHost() DALLASCITYHALL.com u.getPath() /news/ARTICLE-1.html u.getFile() /news/ARTICLE-1.html u.getQuery() null u.getRef() null u.getAuthority() DALLASCITYHALL.com u.getUserInfo() null urlToString(u) http://dallascityhall.com/news/ARTICLE-1.html u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue u.getQuery() q=somevalue u.getRef() null u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue u.toString() https://DallasCityHall.com/news/ARTICLE-1.html#subpart1 u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /news/ARTICLE-1.html u.getFile() /news/ARTICLE-1.html u.getQuery() null u.getRef() subpart1 u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news/ARTICLE-1.html#subpart1 u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue&q2=someOtherValue u.getQuery() q=somevalue&q2=someOtherValue u.getRef() null u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue&q2=someOtherValue u.getQuery() q=somevalue&q2=someOtherValue u.getRef() LocalRef u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef - Parameters:
sw
- An instance of class StorageWriter. This parameter may be null, and if it is text-output will be sent to Standard Output.- Throws:
java.io.IOException
- Code:
- Exact Method Body:
// StorageWriter sw = new StorageWriter(); if (sw == null) sw = new StorageWriter(); String[] urlStrArr = { "https://DALLASCITYHALL.com", "https://dallascityhall.com/", "https://dallascityhall.com/news", "https://dallascityhall.com/news/", "http://DALLASCITYHALL.com/news/ARTICLE-1.html", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue", "https://DallasCityHall.com/news/ARTICLE-1.html#subpart1", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef" }; URL[] urlArr = new URL[urlStrArr.length]; try { for (int i=0; i < urlStrArr.length; i++) urlArr[i] = new URL(urlStrArr[i]); } catch (Exception e) { sw.println( "Broke a URL, and it generated an exception.\n" + "Sorry, fix the URL's in this method.\n" + "Did you change them?\n" ); e.printStackTrace(); return; } for (URL u : urlArr) { System.out.println( "u.toString():\t\t" + BCYAN + u.toString() + RESET + '\n' + "u.getProtocol():\t" + u.getProtocol() + '\n' + "u.getHost():\t\t" + u.getHost() + '\n' + "u.getPath():\t\t" + u.getPath() + '\n' + "u.getFile():\t\t" + u.getFile() + '\n' + "u.getQuery():\t\t" + u.getQuery() + '\n' + "u.getRef():\t\t" + u.getRef() + '\n' + "u.getAuthority():\t" + u.getAuthority() + '\n' + "u.getUserInfo():\t" + u.getUserInfo() + '\n' + "urlToString(u):\t\t" + urlToString(u) ); } // FileRW.writeFile(C.toHTML(sw.getString()), "URLs.html");
-
toProperURLV1
public static java.lang.String toProperURLV1(java.lang.String url)
This will substitute many of the Spanish-characters that can make a web-query difficult. These are the substitutions listed:
Spanish Language Character URL Escape Sequence Á
%C3%81
á
%C3%A1
É
%C3%89
é
%C3%A9
Í
%C3%8D
í
%C3%AD
Ó
%C3%93
ó
%C3%B3
Ú
%C3%9A
ú
%C3%BA
Ü
%C3%9C
ü
%C3%BC
Ñ
%C3%91
ñ
%C3%B1
Ý
%C3%9D
ý
%C3%BD
NOTE: This was the first time that aURL
needed to be encoded when writing the Java-HTML Scrape Package.- Parameters:
url
- Any websiteURL
query.- Returns:
- The same
URL
with substitutions made. - See Also:
VOWELS
,VOWELS_URL
,StrReplace.r(String, char[], String[])
- Code:
- Exact Method Body:
return StrReplace.r(url, VOWELS, VOWELS_URL);
-
toProperURLV2
public static java.lang.String toProperURLV2(java.lang.String url)
This will clobber the initial http://domain.name.something/ - so it is best to use this onString
-Tokens/Literals that are going to be inserted "after the ampersand" or maybe "after the question-mark." When generating arguments to be passed via "JSON" (or whatever) - when passing arguments to GET/POST, this may be used to escape the characters inside the parameters, rather than the entire URL itself.
IN JAVA The following 2 characters need to be escaped:\ "
IN REGULAR-EXPRESSIONS The following characters need to be escaped:
+ * ? ^ $ \ . [ ] ( ) | /
{ }
IN HTTP-URL'S It helps to escape these:
# $ % & @ ` / : ; < = > ? [ \ ] ^ | ~ " ' + ,
{ }
NOTE: This is an earlier 'version' ofURL
-Escaping that came up, and is used in one part of this HTML Search and Scrape Package. It is kept here for legacy reasons, althoughURL
Encoder Version #5 and #6 are likely the most intelligentURL
Escape & Encoding methods to use. In both of them, theURL
Host and the Protocol-String (a.k.a. "http" or "https") are left alone completely, while the file & directory paths are the onlyString's
that whose UTF-8 characters are actually escaped.- Parameters:
url
- Any information that is intended to be sent via GET or POST- Returns:
- An escaped version of this
URL
- See Also:
URL_ESC_CHARS
,StrReplace.r(String, char[], IntCharFunction)
- Code:
- Exact Method Body:
return StrReplace.r(url, URL_ESC_CHARS, (int i, char c) -> '%' + Integer.toHexString((int) c));
-
toProperURLV3
public static java.lang.String toProperURLV3(java.lang.String url)
This leaves out the actual domain name before starting HTTP-URL Escape Sequences. If this starts with the words "http://domain.something/" then the initial colon, forward-slash and periods won't be escaped. Everything after the first front-slash will include URL-HTTP Escape characters.
This does the same thing astoProperURLV2(String)
, but skips the initial part of the URL text/string - IF PRESENT!http(s?)://domain.something/
is skipped by the Regular Expression, everything else fromURLV2
is escaped.- Parameters:
url
- This may be any internetURL
, represented as aString
. It will be escaped with the%INT
format.- Returns:
- An escaped
URL String
- See Also:
toProperURLV2(String)
,P1
- Code:
- Exact Method Body:
String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + toProperURLV2(url);
-
toProperURLV4
public static java.lang.String toProperURLV4(java.lang.String url)
This does the same thing as V3, but it also will avoid escaping any'?'
(question-mark) or'&'
(ampersand) or'/'
(forward-slash) symbols anywhere in the entireString
. It also "skips" escaping the initialHTTP(s)://domain.net.something/
as well - just liketoProperURLV3
- Returns:
- This does the same thing as
toProperURLV3(String)
, but leaves out 100% of the instances of Ampersand, Question-Mark, and Forward-Slash symbols. - See Also:
toProperURLV3(String)
,P1
,URL_ESC_CHARS_ABBREV
,StrReplace.r(String, char[], IntCharFunction)
- Code:
- Exact Method Body:
String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + StrReplace.r (url, URL_ESC_CHARS_ABBREV, (int i, char c) -> '%' + Integer.toHexString((int) c));
-
toProperURLV5
public static java.lang.String toProperURLV5(java.lang.String url)
This is probably the "smartest" URL Encoder. The Java URL-Encoder doesn't do any good! It literally encodes the forward-slashes inside the "HTTP://" string! That is a major mistake. Understanding how URL encoding works basically requires downloading web-pages.
NOTE 1: DNS does not really allow non-ASCII characters to be included inside of a domain-name. Doing any character-escaping inside of the host-part of a URL is not necessary, and if a programmer is trying to escape characters inside the "host" of a URL, he must not have tested the URL, because it is unlikely valid. Perhaps in other parts of the world, if DNS is used in other parts of the world.
NOTE 2: Escaping characters in the directory or file part of aURL
is generally a good idea, but there are many web-servers that are capable of dealing with foreign-language andUTF-8
characters. In fact, for most of theURL's
that were used during the development of this package - which includes many links to the Chinese Government "Web Portal" on the Internet, noURL
-Encoding or "URL-Escaping-Characters" were required, and all of them were loaded with UTF-8 (non-ASCII) Mandarin Chinese Characters. However, there are many web-servers that do not like non-ASCII characters inside the File/Path that comes after the domain. The "Wiki-Art" project web-server, for instance, expects that any accented European French or Spanish Vowels (of which almost all European languages contain quite a few - even German) are allURL
-Encoded "Escaped" using theUTF-8
HTML Escape-Sequences.
NOTE 3: Most importantly, the way any web-server handles the query-strings might even also be different than the way it handles the file and path strings. Generally, there is no guaranteed successful way to deal withURL
-encoding, since there are many different types of web-servers on the internet. Moreover, how things are handled overseas in more developed countries of Asia makes knowing what is going on even more difficult.
FINAL NOTE: This version ofURL
-encoding encodes only one portion of aURL
, and that is the file and directory portion. If there is a query-string included in this url, it won't be removed, but it will be ignored, and left unchanged. If there is a'ref'
portion of thisURL
, it will also be ignored, and left unchanged. Again, only the file & directory name of theURL
shall be encoded using the"% %"
(percent-percentURL
encoding scheme)
MOSTLY: The earlier versions ofURL
encoding experiments are being left in this package, even though they are probably not too useful, because mostly they are harmless, and probably explain a little bit about the "developer progression" while coding this package.- Parameters:
url
- This is the URL to be encoded, properly- Returns:
- A properly encoded URL String. Important, if calling the
java.net.URL
constructor generates aMalformedURLException
, then this method shall return. Thejava.net.URL
constructor will be called if theString
passed begins with the characters'http://'
or'https://'
. - Code:
- Exact Method Body:
url = url.trim(); URL u = null; String[] sArr = null; String tlc = url.toLowerCase(); if (tlc.startsWith("http://") || tlc.startsWith("https://")) { try { u = new URL(url); } catch (Exception e) { return null; } } if (u == null) sArr = url.split("/"); else sArr = u.getPath().split("/"); String slash = ""; StringBuilder sb = new StringBuilder(); for (String s : sArr) { try { sb.append(slash + java.net.URLEncoder.encode(s, "UTF-8")); } catch (UnsupportedEncodingException e) { /* This really cannot happen, and I don't know what to put here! */ } slash = "/"; } if (u == null) return sb.toString(); else return u.getProtocol() + "://" + u.getHost() + sb.toString() + ((u.getQuery() != null) ? ("?" + u.getQuery()) : "") + ((u.getRef() != null) ? ("#" + u.getRef()) : "");
-
toProperURLV6
public static java.lang.String toProperURLV6(java.lang.String url)
Rather than trying to explain what is escaped and what is left alone, please review the exact code here.
IMPORTANT NOTE: On a close inspection and analysis of this code, one ought to realize that the above previous five versions of URL-Encoding, "experimentation" was going on. This, last and final version of URL-Encoding is actually pretty successful. It handles all "extra characters" and is capable of dealing withURL's
that contain the'?' '=' '&'
operators of GET requests.
LEGACY NOTE: The previous fiveURL
encoders are not going to be erased - as they leave the "learning trail" of what is going on with encoding URL's. One ought to realize that though the out-of-the-box (out-of-the-download) JDK, there is a class called "URI Encoder" - however this class expects that theURL
has already been separated out into it's distinct parts. This method, indeed, does do the separating out of theURL's
disparate parts before performing the character-escaping.- Parameters:
url
- This is any javaURL
.- Returns:
- a new
String
version of the input parameter'url'
- Code:
- Exact Method Body:
URL u = null; try { u = new URL(url); } catch (Exception e) { return null; } StringBuilder sb = new StringBuilder(); sb.append(u.getProtocol()); sb.append("://"); sb.append(u.getHost()); sb.append(toProperURLV5(u.getPath())); if (u.getQuery() != null) { String[] sArr = u.getQuery().split("&"); StringBuilder sb2 = new StringBuilder(); String ampersand = ""; for (String s : sArr) { String[] s2Arr = s.split("="); StringBuilder sb3 = new StringBuilder(); String equals = ""; for (String s2: s2Arr) { try { sb3.append(equals + java.net.URLEncoder.encode(s2, "UTF-8")); } // This should never happen - UTF-8 is (sort-of) the only encoding. catch (UnsupportedEncodingException e) { } equals = "="; } sb2.append(ampersand + sb3.toString()); ampersand = "&"; } sb.append("?" + sb2.toString()); } // Not really a clue, because a the "#" operator and the "?" probably shouldn't be used together. // Java's java.net.URL class will parse a URL that has both the ? and the #, but I have no idea // which web-sites would allow this, or encourage this... if (u.getRef() != null) try { sb.append("#" + java.net.URLEncoder.encode(u.getRef(), "UTF-8")); } catch (UnsupportedEncodingException e) { } return sb.toString();
-
toProperURLV7
public static java.lang.String toProperURLV7(java.lang.String url) throws java.net.URISyntaxException, java.net.MalformedURLException
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.- Parameters:
url
- A Complete Java URL, as a String. Any specialized escape characters that need to be escaped, shall be.- Throws:
java.net.URISyntaxException
- This will throw if building theURI
generates an exception. Internally, all this method does is build aURI
, and then call the Java Method'toASCIIString()'
java.net.MalformedURLException
- Code:
- Exact Method Body:
return toProperURLV8(new URL(url));
-
toProperURLV8
public static java.lang.String toProperURLV8(java.net.URL url) throws java.net.URISyntaxException, java.net.MalformedURLException
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.- Parameters:
url
- A Complete JavaURL
. Any specialized escape characters that need to be escaped, shall be.- Throws:
java.net.URISyntaxException
- This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method'toASCIIString()'
java.net.MalformedURLException
- Code:
- Exact Method Body:
return new URI( url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef() ).toASCIIString();
-
removeDuplicates
public static int removeDuplicates(java.util.Vector<java.net.URL> urls)
If you have a list ofURL's
, and want to quickly remove any duplicate-URL's
found in the list - this will remove them.
NOTE: This will perform a few "to-lower-case" operations on the protocol and web-domain, but not perform "to-lower-case" on the file, directory, or query-string part of theURL
.
SPECIFICALLY:- These are considered duplicate URL's:
http://some.company.com/index.html
HTTP://SOME.COMPANY.COM/index.html
- These are not considered duplicate URL's:
http://other.company.com/Directory/Ben-Bitdiddle.html
http://other.company.com/DIRECTORY/BE.html
- Parameters:
urls
- Any list ofURL's
, some of which might have been duplicated. The difference between this'removeDuplicates'
and the other'removeDuplicates'
available in this class is that this one only removes multiple instances of the sameURL
in thisVector
, while the other one iterates through a list ofURL's
already visited in a previous-session.
NOTE: NullVector
-values are skipped outright, they are neither removed nor changed.- Returns:
- The number of
Vector
elements that were removed. (i.e. The size by which theVector
was shrunk.) - Code:
- Exact Method Body:
TreeSet<String> dups = new TreeSet<>(); int count = 0; int size = urls.size(); URL url = null; for (int i=0; i < size; i++) if ((url = urls.elementAt(i)) != null) if (! dups.add(urlToString(url))) { count++; size--; i--; urls.removeElementAt(i); } return count;
- These are considered duplicate URL's:
-
removeDuplicates
public static int removeDuplicates (java.util.Vector<java.net.URL> visitedURLs, java.util.Vector<java.net.URL> potentiallyNewURLs)
This simple method will remove anyURL's
from the inputVector
parameter'potentiallyNewURLs'
which are also present-members of the inputVector
parameter'visitedURLs'
. This may seem trivial, and it is, but it worries about things likeString
-case for you.- Parameters:
visitedURLs
- This parameter is a list ofURL's
that have already "been visited."potentiallyNewURLs
- This parameter is a list ofURL's
that are possibly "un-visited" - meaning whatever scrape, crawl or search being performed needs to know whichURL's
are listed in the previous parameter's contents. This may seem trivial, just use the javaurl1.equals(url2)
command, but, alas, java doesn't exactly take into account upper-case and lower-case domain-names. This worries about case.- Returns:
- The number of
URL's
that were removed from the inputVector
parameter'potentiallyNewURLs'
. - Code:
- Exact Method Body:
// The easiest way to check for duplicates is to build a tree-set of all the URL's as a String. // Java's TreeSet<> generic already (automatically) scans for duplicates (efficiently) and will tell // you if you have tried to add a duplicate TreeSet<String> dups = new TreeSet<>(); // Build a TreeSet of the url's from the "Visited URLs" parameter visitedURLs.forEach(url -> dups.add(urlToString(url))); // Add the "Possibly New URLs", one-by-one, and remove them if they are already in the visited list. int count = 0; int size = potentiallyNewURLs.size(); URL url = null; for (int i=0; i < size; i++) if ((url = potentiallyNewURLs.elementAt(i)) != null) if (! dups.add(urlToString(url))) { count++; size--; i--; potentiallyNewURLs.removeElementAt(i); } return count;
-
shortenPoundREF
public static java.net.URL shortenPoundREF(java.net.URL url)
Removes any partial-reference'#'
symbols from aURL
. If thisURL
contains a pound-sign partial reference according to the Standard JDK'sURL.getRef()
method, and creating a newURL
without this reference generates an exception, then this method shall return null.- Parameters:
url
- Any standard HTTPURL
. If this'url'
contains a'#'
(Pound Sign, Partial Reference) - according to the standard JDKURL.getRef()
method, then it shall be removed.- Returns:
- The
URL
without the partial-reference, or the originalURL
if there was no partial reference. Null is returned if there is an error instantiating the newURL
without the partial-reference. - Code:
- Exact Method Body:
try { if (url.getRef() != null) return new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); else return url; } catch (MalformedURLException e) { return null; }
-
shortenPoundREFs
public static int shortenPoundREFs(java.util.Vector<java.net.URL> urls, boolean ifExceptionSetNull)
This may seem like a bad thing to do - it removes all "#Partial-Page-Reference" elements from allURL's
in a list. Generally, one might find such links useful, however, when performing a news-or-content web-site scrape, partial-page-links (i.e. links such as:<A HREF="ThisPage.html#mySubSection8">
) are much more easily dealt with by removing the "hash-tag '#'" partial-reference, and returning the completedURL 'ThisPage.html'
without it. Primarily when scanning for duplicates and trying to avoid the same web-page over and over again, this 'way-of-doing-things' is useful.
THINK: A partial-pageURL
will download the exact-same content to thegetPageTokens(...)
method either way. The hash-tag ('#') really only affects how a browser renders the page your are seeing, not the content of theURL
.- Parameters:
urls
- Any list of completed (read: fully-resolved)URL's
.ifExceptionSetNull
- If this is TRUE then if there is ever an exception building a newURL
without a "Relative URL #" (Pound-Sign), then that position in theVector
will be replaced with 'null.'- Returns:
- The number / count of
URL's
in this list that were modified. If aURL
was modified, it was because it had a partial-page reference in it.
NOTE: If in the process of generating a newURL
out of an old one, aMalformedURLException
occurs, that element in theVector
will just be skipped, and no warning message provided. - Code:
- Exact Method Body:
int pos = 0; int shortenCount = 0; for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); } } return shortenCount;
-
shortenPoundREFs_KE
public static Ret2<java.lang.Integer,java.util.Vector<java.net.MalformedURLException>> shortenPoundREFs_KE (java.util.Vector<java.net.URL> urls, boolean ifExceptionSetNull)
This may seem like a bad thing to do - it removes all "#Partial-Page-Reference" elements from allURL's
in a list. Generally, one might find such links useful, however, when performing a news-or-content web-site scrape, partial-page-links (i.e. links such as:<A HREF="ThisPage.html#mySubSection8">
) are much more easily dealt with by removing the "hash-tag'#'
" partial-reference, and returning the completedURL 'ThisPage.html'
without it. Primarily when scanning for duplicates and trying to avoid the same web-page over and over again, this 'way-of-doing-things' is useful.
THINK: A partial-pageURL
will download the exact-same content to thegetPageTokens(...)
method either way. The hash-tag ('#') really only affects how a browser renders the page your are seeing, not the content of theURL
.
NOTE: This method does the exact same thing, verbatim as the previous method by the same name, but if there are any exceptions while building theURL
list - leaving out the'#'
(pound-signs), those exceptions will be saved and stored in a returnVector
. This can be useful when working with large numbers ofURL's
. Only a few of which cannot be resolved.
'KE' - Keep Exceptions: If this method generates a'MalformedURLException'
it will be returned along with the result (not thrown).- Parameters:
urls
- Any list of completed (read: fully-resolved)URL's
.ifExceptionSetNull
- If this is TRUE then if there is ever an exception building a newURL
without a "RelativeURL '#'
" (Pound-Sign), then that position in theVector
will be replaced with 'null.'- Returns:
- The number/count of
URL's
in this list that were modified. If aURL
was modified, it was because it had a partial-page reference in it. If in the process of generating a newURL
out of an old one, aMalformedURLException
occurs, the exception will be placed in theRet2.b
position, which is aVector<MalformedURLException>
.
SPECIFICALLY:Ret2.a = 'Integer'
number ofURL's
shortened for having a'#'
partial-reference.Ret2.b = Vector<MalformedURLException>
where each element of thisVector
is null if there were no problems converting theURL
, or the exception if there was/were.
- Code:
- Exact Method Body:
int pos = 0; int shortenCount = 0; Vector<MalformedURLException> v = new Vector<>(); for (int i=0; i < urls.size(); i++) v.setElementAt(null, i); for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); v.setElementAt(e, i); } } return new Ret2<Integer, Vector<MalformedURLException>>(Integer.valueOf(shortenCount), v);
-
urlToString
public static java.lang.String urlToString(java.net.URL url)
This is a method that seems "buried", and is somewhat important. On the internet, aURL
is part-case-sensitive, and part case-insensitive. The domain-name and protocol (http://
, and'some.company.com'
) portions of theURL
may be lower or upper case, and the powers-that-be on the internet will not know the difference.
HOWEVER: The directory, file-name, and (possible)query-string
portion of aURL
are very case-sensitive to the individual web-servers retrieving the HTTP / HTML / JSON / Whatever data that they intend to serve. Perhaps this method should have it's own class, but alas, it does not.- Parameters:
url
- This may be any Internet-DomainURL
- Returns:
- A
String
version of thisURL
, but the domain and protocol portions of theURL
will be a "consistent" lower case. The case of the directory, file and (possibly, but not guaranteed to be present)query-string
portion will not have their case modified either way.
NOTE: This type of information is pretty important is you are attempting to scan for duplicateURL's
or check their equality. - Code:
- Exact Method Body:
return ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getPath() != null) ? url.getPath() : "") + ((url.getQuery() != null) ? ('?' + url.getQuery()) : "") + ((url.getRef() != null) ? ('#' + url.getRef()) : "");
-
-