java.lang.Object
- Torello.Java.Additional.URLs

```
public class URLs
extends java.lang.Object
```
A class that plays-with URL's, no more, no less.

This provides a few utility functions for dealing with URL's.

This class does not perform relative / absolute URL resolution. URL resolution (completing a partial-URL using the complete-URL of the page on which the link sits) can be accomplished the class Links, which may be found in the 'Torello.HTML' package.

This class helps analyze, just a tad, escaping certain characters found inside a Uniform Resource Link so that it may connect to a Web-Server, or any HTTP / AJAX-Server.

Modern Existentialism:
This is an "existential" or "experimental" collection of attempts; it is not intended to be serious thing.
Hi-Lited Source-Code:
- View Here: Torello/Java/Additional/URLs.java
- Open New Browser-Tab: Torello/Java/Additional/URLs.java
File Size: 34,266 Bytes Line Count: 843 '\n' Characters Found
Stateless Class:
This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 16 Method(s), 16 declared static
- 6 Field(s), 6 declared static, 6 declared final

Field Summary

Fields
Modifier and Type	Field
`protected static Pattern`	`P1`
`protected static String`	`RE1`
`protected static char[]`	`URL_ESC_CHARS`
`protected static char[]`	`URL_ESC_CHARS_ABBREV`
`protected static char[]`	`VOWELS`
`protected static String[]`	`VOWELS_URL`

Method Summary

Fun with URL Escape

Modifier and Type	Method
`static String`	`toProperURLV1(String url)`
`static String`	`toProperURLV2(String urlStuff)`
`static String`	`toProperURLV3(String url)`
`static String`	`toProperURLV4(String url)`
`static String`	`toProperURLV5(String url)`
`static String`	`toProperURLV6(String url)`
`static String`	`toProperURLV7(String url)`
`static String`	`toProperURLV8(URL url)`

Remove Anchor-Name / Relative / Fragment URLs
Modifier and Type	Method
`static URL`	`shortenPoundREF(URL url)`
`static int`	`shortenPoundREFs(Vector<URL> urls, boolean ifExceptionSetNull)`
`static Ret2<Integer, Vector<MalformedURLException>>`	`shortenPoundREFs_KE(Vector<URL> urls, boolean ifExceptionSetNull)`

Remove Duplicate URL's from a List
Modifier and Type	Method
`static int`	`removeDuplicates(Vector<URL> urls)`
`static int`	`removeDuplicates(Vector<URL> visitedURLs, Vector<URL> potentiallyNewURLs)`

Other Methods
Modifier and Type	Method
`static void`	`CURL(URL url, String outFileName, String userAgent)`
`protected static void`	`javaURLHelpMessage(StorageWriter sw)`
`static String`	`urlToString(URL url)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- RE1
  
  🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final java.lang.String RE1
```
  This is a Regular-Expression Pattern (java.util.regex.Pattern) - saved as a String. It is subsequently compiled.
  
  The primary function is to match String's that are intended to match HTTP-URL's. This Regular Expression matches:
  
  http(s)://...<any-text>.../
  
  http(s)://...<any-text, not front-slash>...
  
  http(s)://...<any-text>.../...<any-text, not front-slash>...
  
  Primarily used in:
  
  toProperURLV3(String)
  
  toProperURLV4(String)
  See Also:
  
  P1, Constant Field Values
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final String RE1 = "^(http[s]?:\\/\\/.*?\\/$|http[s]?:\\/\\/[^\\/]*$|http[s]?:\\/\\/.*?\\/[^\\/]+)";
- P1
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final java.util.regex.Pattern P1
```
  P1 = Pattern.compile(RE1);
  See Also:
  
  RE1
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final Pattern P1 = Pattern.compile(RE1);
- VOWELS
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final char[] VOWELS
```
  When scraping Spanish URL's, these characters can / should be escaped.
  
  Parallel Array Note:
  This array shall be considered parallel to the replacement String[]-Array VOWELS_URL.
  See Also:
  
  toProperURLV1(String), VOWELS_URL
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final char[] VOWELS = { 'á', 'É', 'é', 'Í', 'í', 'Ó', 'ó', 'Ú', 'ú', 'Ü', 'ü', 'Ñ', 'ñ', 'Ý', 'ý', '¿', '¡' };
- VOWELS_URL
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final java.lang.String[] VOWELS_URL
```
  When scraping Spanish URL's, these String's are the URL Escape Sequences for the Spanish Vowel Characters listed in VOWELS.
  
  Parallel Array Note:
  This array shall be considered parallel to String[]-Array VOWELS.
  See Also:
  
  toProperURLV1(String), VOWELS
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final String[] VOWELS_URL = { "%C3%A1", "%C3%89", "%C3%A9", "%C3%8D", "%C3%AD", "%C3%93", "%C3%B3", "%C3%9A", "%C3%BA", "%C3%9C", "%C3%BC", "%C3%91", "%C3%B1", "%C3%9D", "%C3%BD", "%C2%BF", "%C2%A1" };
- URL_ESC_CHARS
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final char[] URL_ESC_CHARS
```
  This list of java char's are characters that are better off escaped when passing them through a URL.
  See Also:
  
  toProperURLV2(String)
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final char[] URL_ESC_CHARS = { '%', ' ', '#', '$', '&', '@', '`', '/', ':', ';', '<', '=', '>', '?', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };
- URL_ESC_CHARS_ABBREV
  
  🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
protected static final char[] URL_ESC_CHARS_ABBREV
```
  This is a (shortened) list of characters that should be escaped before being used within a URL.
  
  This version differs from URL_ESC_CHARS in that it does not include the '&' (ampersand), the '?' (question-mark) or the '/' (forward-slash).
  See Also:
  
  URL_ESC_CHARS, toProperURLV4(String)
  
  Code:
  
  Exact Field Declaration Expression:
  
  protected static final char[] URL_ESC_CHARS_ABBREV = { '%', ' ', '#', '$', '@', '`', ':', ';', '<', '=', '>', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };

Method Detail

javaURLHelpMessage

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

protected static final void javaURLHelpMessage(StorageWriter sw)

Java Help Messag Explaining class java.net.URL - and the specific output of its methods. This will just print a 'friendly-reminder' to the terminal-console output showing what the actual output of the class java.net.URL actually is. This helps when breaking-up / resolving partial URL links and partial Image-URL links.

Generally, playing with URL's gets more intersting when any foreign language characters are involved. There is an EXTREMELY standardized way to encode characters in just about any language in the world (and the name of that "way" is UTF-8).

Dallas Government Filth:
The following output was generated when scraping the City of Dallas Web-Server collecting E-Mail addresses for the E-Mail Distribution list regarding Human-Rights Abuses (Hypno-Programming) in this city. Programmers would not be obligated to write their City-Council Man or their Congressman to use any of the material in this scrape package. However, if you are concerned about the abuses of power in the "former" United States, scraping government-websites to collect the e-mail addresses of the rapist-trash in power in Dallas, it is very easy using this package.

java.net.URL Method()	String-Result
u.toString()	https://DALLASCITYHALL.com
u.getProtocol()	https
u.getHost()	DALLASCITYHALL.com
u.getPath()
u.getFile()
u.getQuery()	null
u.getRef()	null
u.getAuthority()	DALLASCITYHALL.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com
u.toString()	https://dallascityhall.com/
u.getProtocol()	https
u.getHost()	dallascityhall.com
u.getPath()	/
u.getFile()	/
u.getQuery()	null
u.getRef()	null
u.getAuthority()	dallascityhall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/
u.toString()	https://dallascityhall.com/news
u.getProtocol()	https
u.getHost()	dallascityhall.com
u.getPath()	/news
u.getFile()	/news
u.getQuery()	null
u.getRef()	null
u.getAuthority()	dallascityhall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/news
u.toString()	https://dallascityhall.com/news/
u.getProtocol()	https
u.getHost()	dallascityhall.com
u.getPath()	/news/
u.getFile()	/news/
u.getQuery()	null
u.getRef()	null
u.getAuthority()	dallascityhall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/news/
u.toString()	http://DALLASCITYHALL.com/news/ARTICLE-1.html
u.getProtocol()	http
u.getHost()	DALLASCITYHALL.com
u.getPath()	/news/ARTICLE-1.html
u.getFile()	/news/ARTICLE-1.html
u.getQuery()	null
u.getRef()	null
u.getAuthority()	DALLASCITYHALL.com
u.getUserInfo()	null
urlToString(u)	http://dallascityhall.com/news/ARTICLE-1.html
u.toString()	https://DallasCityHall.com/NEWS/article1.html?q=somevalue
u.getProtocol()	https
u.getHost()	DallasCityHall.com
u.getPath()	/NEWS/article1.html
u.getFile()	/NEWS/article1.html?q=somevalue
u.getQuery()	q=somevalue
u.getRef()	null
u.getAuthority()	DallasCityHall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/NEWS/article1.html?q=somevalue
u.toString()	https://DallasCityHall.com/news/ARTICLE-1.html#subpart1
u.getProtocol()	https
u.getHost()	DallasCityHall.com
u.getPath()	/news/ARTICLE-1.html
u.getFile()	/news/ARTICLE-1.html
u.getQuery()	null
u.getRef()	subpart1
u.getAuthority()	DallasCityHall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/news/ARTICLE-1.html#subpart1
u.toString()	https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue
u.getProtocol()	https
u.getHost()	DallasCityHall.com
u.getPath()	/NEWS/article1.html
u.getFile()	/NEWS/article1.html?q=somevalue&q2=someOtherValue
u.getQuery()	q=somevalue&q2=someOtherValue
u.getRef()	null
u.getAuthority()	DallasCityHall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue
u.toString()	https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef
u.getProtocol()	https
u.getHost()	DallasCityHall.com
u.getPath()	/NEWS/article1.html
u.getFile()	/NEWS/article1.html?q=somevalue&q2=someOtherValue
u.getQuery()	q=somevalue&q2=someOtherValue
u.getRef()	LocalRef
u.getAuthority()	DallasCityHall.com
u.getUserInfo()	null
urlToString(u)	https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef

Parameters:

sw - An instance of class StorageWriter. This parameter may be null, and if it is text-output will be sent to Standard-Output.

Code:

Exact Method Body:

 if (sw == null) sw = new StorageWriter();

 String[] urlStrArr = {
     "https://DALLASCITYHALL.com", "https://dallascityhall.com/",
     "https://dallascityhall.com/news",
     "https://dallascityhall.com/news/", "http://DALLASCITYHALL.com/news/ARTICLE-1.html",
     "https://DallasCityHall.com/NEWS/article1.html?q=somevalue",
     "https://DallasCityHall.com/news/ARTICLE-1.html#subpart1",
     "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue",
     "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef"
 };

 URL[] urlArr = new URL[urlStrArr.length];

 try
     { for (int i=0; i < urlStrArr.length; i++) urlArr[i] = new URL(urlStrArr[i]); }

 catch (Exception e)
 {
     sw.println(
         "Broke a URL, and it generated an exception.\n" +
         "Sorry, fix the URL's in this method.\n" + 
         "Did you change them?"
     );

     e.printStackTrace();
     return;
 }

 for (URL u : urlArr)
 {
     System.out.println(
         "u.toString():\t\t"     + BCYAN + u.toString() + RESET + '\n' +
         "u.getProtocol():\t"    + u.getProtocol() + '\n' +
         "u.getHost():\t\t"      + u.getHost() + '\n' +
         "u.getPath():\t\t"      + u.getPath() + '\n' +
         "u.getFile():\t\t"      + u.getFile() + '\n' +
         "u.getQuery():\t\t"     + u.getQuery() + '\n' +
         "u.getRef():\t\t"       + u.getRef() + '\n' +
         "u.getAuthority():\t"   + u.getAuthority() + '\n' +
         "u.getUserInfo():\t"    + u.getUserInfo() + '\n' +
         "urlToString(u):\t\t"   + urlToString(u)
     );
 }

toProperURLV1

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String toProperURLV1(java.lang.String url)

This will substitute many of the Spanish-characters that can make a web-query difficult. These are the substitutions listed:

Spanish Language Character	URL Escape Sequence
`Á`	`%C3%81`
`á`	`%C3%A1`
`É`	`%C3%89`
`é`	`%C3%A9`
`Í`	`%C3%8D`
`í`	`%C3%AD`
`Ó`	`%C3%93`
`ó`	`%C3%B3`
`Ú`	`%C3%9A`
`ú`	`%C3%BA`
`Ü`	`%C3%9C`
`ü`	`%C3%BC`
`Ñ`	`%C3%91`
`ñ`	`%C3%B1`
`Ý`	`%C3%9D`
`ý`	`%C3%BD`

Historical Note:
This method was written the very first time that a URL needed to be escaped during the writing of the Java-HTML '.jar'.

Parameters:

url - Any website URL query.

Returns:

The same URL with substitutions made.

Code:

Exact Method Body:

 return StrReplace.r(url, VOWELS, VOWELS_URL);

toProperURLV2

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV2(java.lang.String urlStuff)
```
This method will clobber the leading Domain-Name and Protocol - http://domain.name.something/ stuff. It is best to use this method on String's that will be inserted into a URL after the '?' question-mark, inside the Query-String.

This can be very useful when sending JSON Arguments, for instance, inside a URL's Query-String, instead of the GET / POST part of a request.

Note that this method should not be used to escape characters outside of the range of Standard-ASCII (characters 0 ... 255).

State of the Experiment:
It seems to help to escape these characters:
# $ % & @ ` / : ; < = > ? [ \ ] ^ | ~ " ' + , { }
Parameters:

urlStuff - Any information that is intended to be sent via an HTTP-URL, and needs to be escaped.

Returns:

An escaped version of this URL-String

See Also:

URL_ESC_CHARS, StrReplace.r(String, char[], IntCharFunction)

Code:
Exact Method Body:

return StrReplace.r( urlStuff, URL_ESC_CHARS, (int i, char c) -> '%' + Integer.toHexString((int) c) );

toProperURLV3

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV3(java.lang.String url)
```
This leaves out the actual domain name before starting HTTP-URL Escape Sequences. If this starts with the words "http://domain.something/" then the initial colon, forward-slash and periods won't be escaped. Everything after the first front-slash will include URL-HTTP Escape characters.

This does the same thing as toProperURLV2(String), but skips the initial part of the URL text/string - IF PRESENT!

http(s?)://domain.something/ is skipped by the Regular Expression, everything else from URLV2 is escaped.
Parameters:

url - This may be any internet URL, represented as a String. It will be escaped with the %INT format.

Returns:

An escaped URL String

See Also:

toProperURLV2(String), P1

Code:
Exact Method Body:

String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + toProperURLV2(url);

toProperURLV4

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV4(java.lang.String url)
```
This does the same thing as V3, but it also will avoid escaping any '?' (question-mark) or '&' (ampersand) or '/' (forward-slash) symbols anywhere in the entire String. It also "skips" escaping the initial HTTP(s)://domain.net.something/ as well - just like toProperURLV3
Returns:

This does the same thing as toProperURLV3(String), but leaves out 100% of the instances of Ampersand, Question-Mark, and Forward-Slash symbols.

See Also:

toProperURLV3(String), P1, URL_ESC_CHARS_ABBREV, StrReplace.r(String, char[], IntCharFunction)

Code:
Exact Method Body:

String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + StrReplace.r (url, URL_ESC_CHARS_ABBREV, (int i, char c) -> '%' + Integer.toHexString((int) c));

toProperURLV5

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV5(java.lang.String url)
```
This is probably the "smartest" URL Encoder. The Java URL-Encoder doesn't do any good! It literally encodes the forward-slashes inside the "HTTP://" string! That is a major mistake. Understanding how URL encoding works basically requires downloading many Web-Pages.

Simple ASCII:
DNS does not really allow non-ASCII characters to be included inside of a Domain-Name. Doing any character-escaping inside of the host-part of a URL is not necessary, and if a programmer is trying to escape characters inside the "host" of a URL, he must not have tested the URL, because it is not likely to be valid.

Perhaps in other parts of the world, if American-DNS is used in other parts of the world.

UTF-8 Foreign Language Characters:
Escaping characters in the directory or file part of a URL is generally a good idea, but there are many web-servers that are capable of dealing with Foreign-Language and UTF-8 characters by themselves just fine. In fact, for most of the URL's sent to the Chinese Government Web-Portal, no URL-Encoding (Character-Escaping) was necessary at all. All of them contained Chinese Language Characters.

There are some Web-Servers that do not like non-ASCII Characters inside the File/Path that comes after the domain. The "Wiki-Art" project Web-Server, for instance, expects that any accented European French or Spanish Vowels all be URL-Encoded "Escaped" using the UTF-8 HTML Escape-Sequences.

URL Query Strings:
Most importantly, the way any Web-Server handles the Query-Strings might even also be different than the way it handles the file and path String's. Generally, there is no guaranteed consistent & successful way to deal with URL-encoding, since there are many different types of Web-Servers on the Internet.

Moreover, how things are handled overseas in more developed countries of Asia makes knowing what is going on even more difficult.

After Domain-Name:
This URL-Encoding method only encodes the file and directory portion. The Domain-Name and 'HTTP' part are left alone. If there is a Query-String included in this URL, it will be left unchanged.

If there is a 'ref' portion of this URL, it will also be left unchanged.

Again, only the file & directory name of the URL shall be encoded with the "% %" (percent-percent URL encoding scheme)

Little Summary:
Mostly, the earlier versions of URL encoding experiments are being left in this package, even though they might qualify as 'useless.' I don't actually visit a lot of new Web-Sites, and therefore new URL's is hard to test and think about.
Parameters:

url - This is the URL to be encoded, properly

Returns:

A properly encoded URL String. Important, if calling the java.net.URL constructor generates a MalformedURLException, then this method shall return. The java.net.URL constructor will be called if the String passed begins with the characters 'http://' or 'https://'.

Code:
Exact Method Body:

url = url.trim(); URL u = null; String[] sArr = null; String tlc = url.toLowerCase(); if (tlc.startsWith("http://") || tlc.startsWith("https://")) { try { u = new URL(url); } catch (Exception e) { return null; } } if (u == null) sArr = url.split("/"); else sArr = u.getPath().split("/"); String slash = ""; StringBuilder sb = new StringBuilder(); for (String s : sArr) { try { sb.append(slash + java.net.URLEncoder.encode(s, "UTF-8")); } catch (UnsupportedEncodingException e) { /* This really cannot happen, and I don't know what to put here! */ } slash = "/"; } if (u == null) return sb.toString(); else return u.getProtocol() + "://" + u.getHost() + sb.toString() + ((u.getQuery() != null) ? ("?" + u.getQuery()) : "") + ((u.getRef() != null) ? ("#" + u.getRef()) : "");

toProperURLV6

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static java.lang.String toProperURLV6(java.lang.String url)

Rather than trying to explain what is escaped and what is left alone, please review the exact code here.

Another One:
Well, I just wrote another one, they told me to. This, newest version of URL-Encoding is actually pretty successful. It handles all Extra-Characters and is capable of dealing with URL's that contain the '?' '=' '&' operators of GET-Requests.

Realize that though the out-of-the-box JDK, there is a class called "URI Encoder" - but that class expects that the URL to have already been separated out into it's distinct parts.

This method does the the URL-Separating into disparate parts before performing the Character-Escaping.

Parameters:

url - This is any java URL.

Returns:

a new String version of the input parameter 'url'

Code:

Exact Method Body:

 URL u = null;

 try
     { u = new URL(url); }

 catch (Exception e) { return null; }

 StringBuilder sb = new StringBuilder();

 sb.append(u.getProtocol());
 sb.append("://");
 sb.append(u.getHost());
 sb.append(toProperURLV5(u.getPath()));

 if (u.getQuery() != null)
 {
     String[]            sArr        = u.getQuery().split("&");
     StringBuilder       sb2         = new StringBuilder();
     String              ampersand   = "";

     for (String s : sArr)
     {
         String[]        s2Arr       = s.split("=");
         StringBuilder   sb3         = new StringBuilder();    
         String          equals      = "";

         for (String s2: s2Arr)
         {
             try
                 { sb3.append(equals + java.net.URLEncoder.encode(s2, "UTF-8")); }

             // This should never happen - UTF-8 is (sort-of) the only encoding.
             catch (UnsupportedEncodingException e) { }

             equals = "=";
         }

         sb2.append(ampersand + sb3.toString());
         ampersand = "&";
     }

     sb.append("?" + sb2.toString());
 }

 // Not really a clue, because a the "#" operator and the "?" probably shouldn't be used
 // together.  Java's java.net.URL class will parse a URL that has both the ? and the #, but
 // I have no idea which Web-Sites would allow this, or encourage this...

 if (u.getRef() != null)

     try
         { sb.append("#" + java.net.URLEncoder.encode(u.getRef(), "UTF-8")); }

     catch (UnsupportedEncodingException e) { }

 return sb.toString();

toProperURLV7

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV7(java.lang.String url)
                                      throws java.net.URISyntaxException,
                                             java.net.MalformedURLException
```
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.
Parameters:

url - A Complete Java URL, as a String. Any specialized Escape-Characters that need to be escaped, will be.

Throws:

java.net.URISyntaxException - This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method 'toASCIIString()'

java.net.MalformedURLException

Code:
Exact Method Body:

return toProperURLV8(new URL(url));

toProperURLV8

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String toProperURLV8(java.net.URL url)
                                      throws java.net.URISyntaxException,
                                             java.net.MalformedURLException
```
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.
Parameters:

url - A Complete Java URL. Any specialized Escape-Characters that need to be escaped, will be.

Throws:

java.net.URISyntaxException - This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method 'toASCIIString()'

java.net.MalformedURLException

Code:
Exact Method Body:

return new URI( url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef() ).toASCIIString();

removeDuplicates

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static int removeDuplicates(java.util.Vector<java.net.URL> urls)
```
If you have a list of URL's, and want to quickly remove any duplicate-URL's found in the list - this will remove them.

Case Sensitivity:
This method will perform a few "to-lower-case" operations on the protocol and Web-Domain parts, but not on the file, directory, or Query-String portion of the URL.

This should hilite what is Case-Sensitive, and what is not:
- These are considered duplicate URL's:
  
  http://some.company.com/index.html
  HTTP://SOME.COMPANY.COM/index.html
- These are not considered duplicate URL's:
  
  http://other.company.com/Directory/Ben-Bitdiddle.html
  http://other.company.com/DIRECTORY/BE.html
Parameters:

urls - Any list of URL's, some of which might have been duplicated. The difference between this 'removeDuplicates' and the other 'removeDuplicates' available in this class is that this one only removes multiple instances of the same URL in this Vector, while the other one iterates through a list of URL's already visited in a previous-session.

NOTE: Null Vector-values are skipped outright, they are neither removed nor changed.

Returns:

The number of Vector elements that were removed. (i.e. The size by which the Vector was shrunk.)

Code:
Exact Method Body:

TreeSet<String> dups = new TreeSet<>(); int count = 0; int size = urls.size(); URL url = null; for (int i=0; i < size; i++) if ((url = urls.elementAt(i)) != null) if (! dups.add(urlToString(url))) { count++; size--; i--; urls.removeElementAt(i); } return count;

removeDuplicates

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖

public static int removeDuplicates
            (java.util.Vector<java.net.URL> visitedURLs,
             java.util.Vector<java.net.URL> potentiallyNewURLs)

This simple method will remove any URL's from the input Vector parameter 'potentiallyNewURLs' which are also present-members of the input Vector parameter 'visitedURLs'.

This may seem trivial, and it is, but it worries about things like the String's Case for you.

Parameters:

visitedURLs - This parameter is a list of URL's that have already "been visited."

potentiallyNewURLs - This parameter is a list of URL's that are possibly "un-visited" - meaning whatever scrape, crawl or search being performed needs to know which URL's are listed in the previous parameter's contents. This may seem trivial, just use the java url1.equals(url2) command, but, alas, java doesn't exactly take into account upper-case and lower-case domain-names. This worries about case.

Returns:

The number of URL's that were removed from the input Vector parameter 'potentiallyNewURLs'.

Code:

Exact Method Body:

 // The easiest way to check for duplicates is to build a tree-set of all the URL's as a
 // String.  Java's TreeSet<> generic already (automatically) scans for duplicates
 // (efficiently) and will tell you if you have tried to add a duplicate

 TreeSet<String> dups = new TreeSet<>();

 // Build a TreeSet of the url's from the "Visited URLs" parameter
 visitedURLs.forEach(url -> dups.add(urlToString(url)));

 // Add the "Possibly New URLs", one-by-one, and remove them if they are already in the
 // visited list.

 int count   = 0;
 int size    = potentiallyNewURLs.size();
 URL url     = null;

 for (int i=0; i < size; i++)

     if ((url = potentiallyNewURLs.elementAt(i)) != null)

         if (! dups.add(urlToString(url)))
         {
             count++;
             size--;
             i--;
             potentiallyNewURLs.removeElementAt(i);
         }

 return count;

shortenPoundREF

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.net.URL shortenPoundREF(java.net.URL url)
```
Removes any Fragment-URL '#' symbols from a URL.

If this URL contains a pound-sign Anchor-Name according to the Standard JDK's URL.getRef() method. Specifically, if URL.getRef() returns a non-null value, this method rebuilds the URL, without any Anchor-Name / Fragment information.

The intention is to return a URL where any / all String-data that occurs after a '#' Hash-Tab / Pound-Sign is removed.
Parameters:

url - Any standard HTTP URL. If this 'url' contains a '#' (Pound Sign, Partial Reference) - according to the standard JDK URL.getRef() method, then it shall be removed.

Returns:

The URL without the partial-reference, or the original URL if there was no partial reference. Null is returned if there is an error instantiating the new URL without the partial-reference.

Code:
Exact Method Body:

try { if (url.getRef() != null) return new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); else return url; } catch (MalformedURLException e) { return null; }

shortenPoundREFs

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static int shortenPoundREFs(java.util.Vector<java.net.URL> urls,
                                   boolean ifExceptionSetNull)
```
This may seem like a bad thing to do - it removes all "#Name-Anchor" Fragment-Elements from all URL's in a list. Generally, a programmer would probably think such links useful.

When performing a News-or-Content Web-Site Scrape, Page-Fragment Links are more easily handled by removing the Hash-Tag '#' (and everything after it) first. After removing the fragment part of the URL, then and only then should the download begin.

In case Named-Anchors isn't a familiar term, they are URL's that have a Hash-Tag '#' followed by a simple-name placed after the file & directory part of the URL. They usually look like: <A HREF='SomeFile.html#SomeFragmentName'>

Named-Anchor Fragments & Duplicates:
It is important to realize that when scanning for duplicate URL's from a list of URL's, different Web-Page Fragments of the exact same Web-Page are still duplicates. By eliminating the Named-Anchor part of a URL, and then scanning the list of URL's afterwards, makes finding duplicate downloads a lot easier.

A Anchor-Name URL will cause a download of the exact-same content either way. The hash-tag ('#') really only affects how a browser renders the page you look at, it does not affect what content is downloaded from the Web-Server at all.

Exception Suppression:
If, in the process of removing the URL-Fragment, a MalformedURLException is thrown, it will be caught ans suppressed, not thrown. What is done afterwards is configurable based on the input-parameters to this method.

This can be useful when working with large numbers of URL's where only a few of them cause problems while resolving them.
Parameters:

urls - Any list of completed (read: fully-resolved) URL's.

ifExceptionSetNull - If this parameter is passed TRUE, if there is ever an exception-throw while building the new URL's (without the fragment / pound-sign), then that position in the Vector will be replaced with a null.

When this parameter is passed FALSE, if an exception is thrown, then it will be caught and silently ignored.

Returns:

The number / count of URL's in this list that were modified. Whenever a URL Named-Anchor is encountered, it will be removed from the URL, and a new URL without the fragment-part will be inserted to replace the old one.

The integer that is returned here is the number of times that a replacement was made to the input Vector-parameter 'urls'.

Code:
Exact Method Body:

int pos = 0; int shortenCount = 0; for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); } } return shortenCount;

shortenPoundREFs_KE

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static Ret2<java.lang.Integer,java.util.Vector<java.net.MalformedURLException>> shortenPoundREFs_KE
            (java.util.Vector<java.net.URL> urls,
             boolean ifExceptionSetNull)
```
This may seem like a bad thing to do - it removes all "#Name-Anchor" Fragment-Elements from all URL's in a list. Generally, a programmer would probably think such links useful.

When performing a News-or-Content Web-Site Scrape, Page-Fragment Links are more easily handled by removing the Hash-Tag '#' (and everything after it) first. After removing the fragment part of the URL, then and only then should the download begin.

In case Named-Anchors isn't a familiar term, they are URL's that have a Hash-Tag '#' followed by a simple-name placed after the file & directory part of the URL. They usually look like: <A HREF='SomeFile.html#SomeFragmentName'>

Named-Anchor Fragments & Duplicates:
It is important to realize that when scanning for duplicate URL's from a list of URL's, different Web-Page Fragments of the exact same Web-Page are still duplicates. By eliminating the Named-Anchor part of a URL, and then scanning the list of URL's afterwards, makes finding duplicate downloads a lot easier.

A Anchor-Name URL will cause a download of the exact-same content either way. The hash-tag ('#') really only affects how a browser renders the page you look at, it does not affect what content is downloaded from the Web-Server at all.

Exception Suppression:
If, in the process of removing the URL-Fragment, a MalformedURLException is thrown, it will be caught ans suppressed, not thrown. What is done afterwards is configurable based on the input-parameters to this method.

This can be useful when working with large numbers of URL's where only a few of them cause problems while resolving them.

KE: Keep Exceptions
This method is identical to the previous method, defined above, except that it allows a programmer to keep / retain any MalformedURLException's that are thrown while re-building them.
Parameters:

urls - Any list of completed (read: fully-resolved) URL's.

ifExceptionSetNull - If this is TRUE then if there is ever an exception building a new URL without a "Relative URL '#'" (Pound-Sign), then that position in the Vector will be replaced with 'null.'

Returns:
The number/count of URL's in this list that were modified. If a URL was modified, it was because it had a partial-page reference in it. If in the process of generating a new URL out of an old one, a MalformedURLException occurs, the exception will be placed in the Ret2.b position, which is a Vector<MalformedURLException>.

SPECIFICALLY:

Ret2.a = 'Integer' number of URL's shortened for having a '#' partial-reference.

Ret2.b = Vector<MalformedURLException> where each element of this Vector is null if there were no problems converting the URL, or the exception reference if there were exceptions thrown.
Code:
Exact Method Body:

int pos = 0; int shortenCount = 0; Vector<MalformedURLException> v = new Vector<>(); for (int i=0; i < urls.size(); i++) v.setElementAt(null, i); for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); v.setElementAt(e, i); } } return new Ret2<Integer, Vector<MalformedURLException>>(Integer.valueOf(shortenCount), v);

urlToString

🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
```
public static java.lang.String urlToString(java.net.URL url)
```
On the internet, a URL is part case-sensitive, and part case-insensitive. The Domain-Name and Protocol (http://, and 'some.company.com') portions of the URL are Case-Insensitive - they may be in any combination of upper or lower case.

However, the directory, file-name, and (optional) Query-String portion of a URL are (often, but not always) Case-Sensitive. The sensitivity to case in these three parts of a URL is dependent upon the individual Web-Server that is providing the content for the URL.

To summarize, DNS servers which monitor the Domain-Name part of a URL treat upper & lower case English-Letters as the same. Web-Server that utilize the File Directory part of a URL will sometimes care about case, and sometimes won't. This behavior is dependent upon how the Web-Master has configured his system.
Parameters:

url - This may be any Internet-Domain URL

Returns:

A String version of this URL, but the domain and protocol portions of the URL will be a "consistent" lower case. The case of the directory, file and (possibly, but not guaranteed to be present) query-string portion will not have their case modified either way.

NOTE: This type of information is pretty important is you are attempting to scan for duplicate URL's or check their equality.

Code:
Exact Method Body:

return ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getPath() != null) ? url.getPath() : "") + ((url.getQuery() != null) ? ('?' + url.getQuery()) : "") + ((url.getRef() != null) ? ('#' + url.getRef()) : "");

CURL

🡅 ⇈ ⮫ 🗕 🗗 🗖
```
public static void CURL(java.net.URL url,
                        java.lang.String outFileName,
                        java.lang.String userAgent)
                 throws java.io.IOException
```
As of today, the version of UNIX curl command does not seem to be downloading everything properly. It downloaded an image '.png' file just fine, but seemed to have botched a zip-file. This does what UNIX 'curl' command, but does not actually invoke the UNIX operating system to do it. It just does this...
Parameters:

url - This may be any URL, but it is intended to be a downloadable file. It will download '.html' files fine, but you may try images, data-files, zip-files, tar-archives, and movies.

outFileName - You must specify a file-name, and if this parameter is null, a NullPointerException will be thrown immediately. If you would like your program to guess the filename - based on the file named in the URL, please use the method URL.getFile(), or something to that effect.

userAgent - A User-Agent, as a String. If this parameter is passed null, it will be silently ignored, and a User-Agent won't be used.

Throws:

java.io.IOException - If there are I/O Errors when using the HttpURLConnection.

Code:
Exact Method Body:

HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (userAgent != null) con.setRequestProperty("User-Agent", userAgent); InputStream is = con.getInputStream(); FileOutputStream fos = new FileOutputStream(outFileName); byte[] b = new byte[5000]; int result = 0; while ((result = is.read(b)) != -1) fos.write(b, 0, result); fos.flush(); fos.close(); is.close();

Class URLs

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

RE1

P1

VOWELS

VOWELS_URL

URL_ESC_CHARS

URL_ESC_CHARS_ABBREV

Method Detail

javaURLHelpMessage

toProperURLV1

toProperURLV2

toProperURLV3

toProperURLV4

toProperURLV5

toProperURLV6

toProperURLV7

toProperURLV8

removeDuplicates

removeDuplicates

shortenPoundREF

shortenPoundREFs

shortenPoundREFs_KE

urlToString

CURL