Package Torello.Java.Additional
Class URLs
- java.lang.Object
-
- Torello.Java.Additional.URLs
-
public class URLs extends java.lang.Object
A class that plays-with URL's, no more, no less.
This provides a few utility functions for dealing withURL's
.
This class does not perform relative / absoluteURL
resolution.URL
resolution (completing a partial-URL
using the complete-URL
of the page on which the link sits) can be accomplished the classLinks
, which may be found in the'Torello.HTML'
package.
This class helps analyze, just a tad, escaping certain characters found inside a Uniform Resource Link so that it may connect to a Web-Server, or any HTTP / AJAX-Server.
Modern Existentialism:
This is an "existential" or "experimental" collection of attempts; it is not intended to be serious thing.
Hi-Lited Source-Code:- View Here: Torello/Java/Additional/URLs.java
- Open New Browser-Tab: Torello/Java/Additional/URLs.java
File Size: 34,266 Bytes Line Count: 843 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 16 Method(s), 16 declared static
- 6 Field(s), 6 declared static, 6 declared final
-
-
Field Summary
Fields Modifier and Type Field protected static Pattern
P1
protected static String
RE1
protected static char[]
URL_ESC_CHARS
protected static char[]
URL_ESC_CHARS_ABBREV
protected static char[]
VOWELS
protected static String[]
VOWELS_URL
-
Method Summary
Fun with URL Escape Modifier and Type Method static String
toProperURLV1(String url)
static String
toProperURLV2(String urlStuff)
static String
toProperURLV3(String url)
static String
toProperURLV4(String url)
static String
toProperURLV5(String url)
static String
toProperURLV6(String url)
static String
toProperURLV7(String url)
static String
toProperURLV8(URL url)
Remove Anchor-Name / Relative / Fragment URLs Modifier and Type Method static URL
shortenPoundREF(URL url)
static int
shortenPoundREFs(Vector<URL> urls, boolean ifExceptionSetNull)
static Ret2<Integer,
Vector<MalformedURLException>>shortenPoundREFs_KE(Vector<URL> urls, boolean ifExceptionSetNull)
Remove Duplicate URL's from a List Modifier and Type Method static int
removeDuplicates(Vector<URL> urls)
static int
removeDuplicates(Vector<URL> visitedURLs, Vector<URL> potentiallyNewURLs)
Other Methods Modifier and Type Method static void
CURL(URL url, String outFileName, String userAgent)
protected static void
javaURLHelpMessage(StorageWriter sw)
static String
urlToString(URL url)
-
-
-
Field Detail
-
RE1
protected static final java.lang.String RE1
This is a Regular-Expression Pattern(java.util.regex.Pattern)
- saved as aString
. It is subsequently compiled.
The primary function is to matchString's
that are intended to match HTTP-URL's
. This Regular Expression matches:http(s)://...<any-text>.../
http(s)://...<any-text, not front-slash>...
http(s)://...<any-text>.../...<any-text, not front-slash>...
Primarily used in:- See Also:
P1
, Constant Field Values- Code:
- Exact Field Declaration Expression:
protected static final String RE1 = "^(http[s]?:\\/\\/.*?\\/$|http[s]?:\\/\\/[^\\/]*$|http[s]?:\\/\\/.*?\\/[^\\/]+)";
-
P1
-
VOWELS
protected static final char[] VOWELS
When scraping SpanishURL's
, these characters can / should be escaped.
Parallel Array Note:
This array shall be considered parallel to the replacementString[]
-ArrayVOWELS_URL
.- See Also:
toProperURLV1(String)
,VOWELS_URL
- Code:
- Exact Field Declaration Expression:
protected static final char[] VOWELS = { 'á', 'É', 'é', 'Í', 'í', 'Ó', 'ó', 'Ú', 'ú', 'Ü', 'ü', 'Ñ', 'ñ', 'Ý', 'ý', '¿', '¡' };
-
VOWELS_URL
protected static final java.lang.String[] VOWELS_URL
When scraping SpanishURL's
, theseString's
are the URL Escape Sequences for the Spanish Vowel Characters listed inVOWELS
.
Parallel Array Note:
This array shall be considered parallel toString[]
-ArrayVOWELS
.- See Also:
toProperURLV1(String)
,VOWELS
- Code:
- Exact Field Declaration Expression:
protected static final String[] VOWELS_URL = { "%C3%A1", "%C3%89", "%C3%A9", "%C3%8D", "%C3%AD", "%C3%93", "%C3%B3", "%C3%9A", "%C3%BA", "%C3%9C", "%C3%BC", "%C3%91", "%C3%B1", "%C3%9D", "%C3%BD", "%C2%BF", "%C2%A1" };
-
URL_ESC_CHARS
protected static final char[] URL_ESC_CHARS
This list of javachar's
are characters that are better off escaped when passing them through aURL
.- See Also:
toProperURLV2(String)
- Code:
- Exact Field Declaration Expression:
protected static final char[] URL_ESC_CHARS = { '%', ' ', '#', '$', '&', '@', '`', '/', ':', ';', '<', '=', '>', '?', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };
-
URL_ESC_CHARS_ABBREV
protected static final char[] URL_ESC_CHARS_ABBREV
This is a (shortened) list of characters that should be escaped before being used within aURL
.
This version differs fromURL_ESC_CHARS
in that it does not include the'&'
(ampersand), the'?'
(question-mark) or the'/'
(forward-slash).- See Also:
URL_ESC_CHARS
,toProperURLV4(String)
- Code:
- Exact Field Declaration Expression:
protected static final char[] URL_ESC_CHARS_ABBREV = { '%', ' ', '#', '$', '@', '`', ':', ';', '<', '=', '>', '[', '\\', ']', '^', '{', '|', '}', '~', '\'', '+', ',' };
-
-
Method Detail
-
javaURLHelpMessage
protected static final void javaURLHelpMessage(StorageWriter sw)
Java Help Messag Explainingclass java.net.URL
- and the specific output of its methods. This will just print a 'friendly-reminder' to the terminal-console output showing what the actual output of theclass java.net.URL
actually is. This helps when breaking-up / resolving partialURL
links and partial Image-URL links.
Generally, playing withURL's
gets more intersting when any foreign language characters are involved. There is an EXTREMELY standardized way to encode characters in just about any language in the world (and the name of that "way" is UTF-8).
Dallas Government Filth:
The following output was generated when scraping the City of Dallas Web-Server collecting E-Mail addresses for the E-Mail Distribution list regarding Human-Rights Abuses (Hypno-Programming) in this city. Programmers would not be obligated to write their City-Council Man or their Congressman to use any of the material in this scrape package. However, if you are concerned about the abuses of power in the "former" United States, scraping government-websites to collect the e-mail addresses of the rapist-trash in power in Dallas, it is very easy using this package.java.net.URL Method() String-Result u.toString() https://DALLASCITYHALL.com u.getProtocol() https u.getHost() DALLASCITYHALL.com u.getPath() u.getFile() u.getQuery() null u.getRef() null u.getAuthority() DALLASCITYHALL.com u.getUserInfo() null urlToString(u) https://dallascityhall.com u.toString() https://dallascityhall.com/ u.getProtocol() https u.getHost() dallascityhall.com u.getPath() / u.getFile() / u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/ u.toString() https://dallascityhall.com/news u.getProtocol() https u.getHost() dallascityhall.com u.getPath() /news u.getFile() /news u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news u.toString() https://dallascityhall.com/news/ u.getProtocol() https u.getHost() dallascityhall.com u.getPath() /news/ u.getFile() /news/ u.getQuery() null u.getRef() null u.getAuthority() dallascityhall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news/ u.toString() http://DALLASCITYHALL.com/news/ARTICLE-1.html u.getProtocol() http u.getHost() DALLASCITYHALL.com u.getPath() /news/ARTICLE-1.html u.getFile() /news/ARTICLE-1.html u.getQuery() null u.getRef() null u.getAuthority() DALLASCITYHALL.com u.getUserInfo() null urlToString(u) http://dallascityhall.com/news/ARTICLE-1.html u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue u.getQuery() q=somevalue u.getRef() null u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue u.toString() https://DallasCityHall.com/news/ARTICLE-1.html#subpart1 u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /news/ARTICLE-1.html u.getFile() /news/ARTICLE-1.html u.getQuery() null u.getRef() subpart1 u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/news/ARTICLE-1.html#subpart1 u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue&q2=someOtherValue u.getQuery() q=somevalue&q2=someOtherValue u.getRef() null u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue u.toString() https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef u.getProtocol() https u.getHost() DallasCityHall.com u.getPath() /NEWS/article1.html u.getFile() /NEWS/article1.html?q=somevalue&q2=someOtherValue u.getQuery() q=somevalue&q2=someOtherValue u.getRef() LocalRef u.getAuthority() DallasCityHall.com u.getUserInfo() null urlToString(u) https://dallascityhall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef - Parameters:
sw
- An instance of class StorageWriter. This parameter may be null, and if it is text-output will be sent to Standard-Output.- Code:
- Exact Method Body:
if (sw == null) sw = new StorageWriter(); String[] urlStrArr = { "https://DALLASCITYHALL.com", "https://dallascityhall.com/", "https://dallascityhall.com/news", "https://dallascityhall.com/news/", "http://DALLASCITYHALL.com/news/ARTICLE-1.html", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue", "https://DallasCityHall.com/news/ARTICLE-1.html#subpart1", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue", "https://DallasCityHall.com/NEWS/article1.html?q=somevalue&q2=someOtherValue#LocalRef" }; URL[] urlArr = new URL[urlStrArr.length]; try { for (int i=0; i < urlStrArr.length; i++) urlArr[i] = new URL(urlStrArr[i]); } catch (Exception e) { sw.println( "Broke a URL, and it generated an exception.\n" + "Sorry, fix the URL's in this method.\n" + "Did you change them?" ); e.printStackTrace(); return; } for (URL u : urlArr) { System.out.println( "u.toString():\t\t" + BCYAN + u.toString() + RESET + '\n' + "u.getProtocol():\t" + u.getProtocol() + '\n' + "u.getHost():\t\t" + u.getHost() + '\n' + "u.getPath():\t\t" + u.getPath() + '\n' + "u.getFile():\t\t" + u.getFile() + '\n' + "u.getQuery():\t\t" + u.getQuery() + '\n' + "u.getRef():\t\t" + u.getRef() + '\n' + "u.getAuthority():\t" + u.getAuthority() + '\n' + "u.getUserInfo():\t" + u.getUserInfo() + '\n' + "urlToString(u):\t\t" + urlToString(u) ); }
-
toProperURLV1
public static java.lang.String toProperURLV1(java.lang.String url)
This will substitute many of the Spanish-characters that can make a web-query difficult. These are the substitutions listed:Spanish Language Character URL Escape Sequence Á
%C3%81
á
%C3%A1
É
%C3%89
é
%C3%A9
Í
%C3%8D
í
%C3%AD
Ó
%C3%93
ó
%C3%B3
Ú
%C3%9A
ú
%C3%BA
Ü
%C3%9C
ü
%C3%BC
Ñ
%C3%91
ñ
%C3%B1
Ý
%C3%9D
ý
%C3%BD
Historical Note:
This method was written the very first time that aURL
needed to be escaped during the writing of the Java-HTML'.jar'
.- Parameters:
url
- Any websiteURL
query.- Returns:
- The same
URL
with substitutions made. - See Also:
VOWELS
,VOWELS_URL
,StrReplace.r(String, char[], String[])
- Code:
- Exact Method Body:
return StrReplace.r(url, VOWELS, VOWELS_URL);
-
toProperURLV2
public static java.lang.String toProperURLV2(java.lang.String urlStuff)
This method will clobber the leading Domain-Name and Protocol -http://domain.name.something/
stuff. It is best to use this method onString's
that will be inserted into aURL
after the'?'
question-mark, inside the Query-String.
This can be very useful when sending JSON Arguments, for instance, inside aURL's
Query-String, instead of the GET / POST part of a request.
Note that this method should not be used to escape characters outside of the range of Standard-ASCII (characters0 ... 255
).
State of the Experiment:
It seems to help to escape these characters:# $ % & @ ` / : ; < = > ? [ \ ] ^ | ~ " ' + ,
{ }
- Parameters:
urlStuff
- Any information that is intended to be sent via an HTTP-URL
, and needs to be escaped.- Returns:
- An escaped version of this
URL-String
- See Also:
URL_ESC_CHARS
,StrReplace.r(String, char[], IntCharFunction)
- Code:
- Exact Method Body:
return StrReplace.r( urlStuff, URL_ESC_CHARS, (int i, char c) -> '%' + Integer.toHexString((int) c) );
-
toProperURLV3
public static java.lang.String toProperURLV3(java.lang.String url)
This leaves out the actual domain name before starting HTTP-URL Escape Sequences. If this starts with the words "http://domain.something/" then the initial colon, forward-slash and periods won't be escaped. Everything after the first front-slash will include URL-HTTP Escape characters.
This does the same thing astoProperURLV2(String)
, but skips the initial part of the URL text/string - IF PRESENT!http(s?)://domain.something/
is skipped by the Regular Expression, everything else fromURLV2
is escaped.- Parameters:
url
- This may be any internetURL
, represented as aString
. It will be escaped with the%INT
format.- Returns:
- An escaped
URL String
- See Also:
toProperURLV2(String)
,P1
- Code:
- Exact Method Body:
String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + toProperURLV2(url);
-
toProperURLV4
public static java.lang.String toProperURLV4(java.lang.String url)
This does the same thing as V3, but it also will avoid escaping any'?'
(question-mark) or'&'
(ampersand) or'/'
(forward-slash) symbols anywhere in the entireString
. It also "skips" escaping the initialHTTP(s)://domain.net.something/
as well - just liketoProperURLV3
- Returns:
- This does the same thing as
toProperURLV3(String)
, but leaves out 100% of the instances of Ampersand, Question-Mark, and Forward-Slash symbols. - See Also:
toProperURLV3(String)
,P1
,URL_ESC_CHARS_ABBREV
,StrReplace.r(String, char[], IntCharFunction)
- Code:
- Exact Method Body:
String beginsWith = null; Matcher m = P1.matcher(url); if (m.find()) { beginsWith = m.group(1); url = url.substring(beginsWith.length()); } return ((beginsWith != null) ? beginsWith : "") + StrReplace.r (url, URL_ESC_CHARS_ABBREV, (int i, char c) -> '%' + Integer.toHexString((int) c));
-
toProperURLV5
public static java.lang.String toProperURLV5(java.lang.String url)
This is probably the "smartest" URL Encoder. The Java URL-Encoder doesn't do any good! It literally encodes the forward-slashes inside the "HTTP://" string! That is a major mistake. Understanding how URL encoding works basically requires downloading many Web-Pages.
Simple ASCII:
DNS does not really allow non-ASCII characters to be included inside of a Domain-Name. Doing any character-escaping inside of the host-part of aURL
is not necessary, and if a programmer is trying to escape characters inside the "host" of aURL
, he must not have tested theURL
, because it is not likely to be valid.
Perhaps in other parts of the world, if American-DNS is used in other parts of the world.
UTF-8 Foreign Language Characters:
Escaping characters in the directory or file part of aURL
is generally a good idea, but there are many web-servers that are capable of dealing with Foreign-Language andUTF-8
characters by themselves just fine. In fact, for most of theURL's
sent to the Chinese Government Web-Portal, noURL
-Encoding (Character-Escaping) was necessary at all. All of them contained Chinese Language Characters.
There are some Web-Servers that do not like non-ASCII Characters inside the File/Path that comes after the domain. The "Wiki-Art" project Web-Server, for instance, expects that any accented European French or Spanish Vowels all beURL
-Encoded "Escaped" using theUTF-8
HTML Escape-Sequences.
URL Query Strings:
Most importantly, the way any Web-Server handles the Query-Strings might even also be different than the way it handles the file and pathString's
. Generally, there is no guaranteed consistent & successful way to deal withURL
-encoding, since there are many different types of Web-Servers on the Internet.
Moreover, how things are handled overseas in more developed countries of Asia makes knowing what is going on even more difficult.
After Domain-Name:
ThisURL
-Encoding method only encodes the file and directory portion. The Domain-Name and'HTTP'
part are left alone. If there is a Query-String included in thisURL
, it will be left unchanged.
If there is a'ref'
portion of thisURL
, it will also be left unchanged.
Again, only the file & directory name of theURL
shall be encoded with the"% %"
(percent-percentURL
encoding scheme)
Little Summary:
Mostly, the earlier versions ofURL
encoding experiments are being left in this package, even though they might qualify as 'useless.' I don't actually visit a lot of new Web-Sites, and therefore newURL's
is hard to test and think about.- Parameters:
url
- This is the URL to be encoded, properly- Returns:
- A properly encoded URL String. Important, if calling the
java.net.URL
constructor generates aMalformedURLException
, then this method shall return. Thejava.net.URL
constructor will be called if theString
passed begins with the characters'http://'
or'https://'
. - Code:
- Exact Method Body:
url = url.trim(); URL u = null; String[] sArr = null; String tlc = url.toLowerCase(); if (tlc.startsWith("http://") || tlc.startsWith("https://")) { try { u = new URL(url); } catch (Exception e) { return null; } } if (u == null) sArr = url.split("/"); else sArr = u.getPath().split("/"); String slash = ""; StringBuilder sb = new StringBuilder(); for (String s : sArr) { try { sb.append(slash + java.net.URLEncoder.encode(s, "UTF-8")); } catch (UnsupportedEncodingException e) { /* This really cannot happen, and I don't know what to put here! */ } slash = "/"; } if (u == null) return sb.toString(); else return u.getProtocol() + "://" + u.getHost() + sb.toString() + ((u.getQuery() != null) ? ("?" + u.getQuery()) : "") + ((u.getRef() != null) ? ("#" + u.getRef()) : "");
-
toProperURLV6
public static java.lang.String toProperURLV6(java.lang.String url)
Rather than trying to explain what is escaped and what is left alone, please review the exact code here.
Another One:
Well, I just wrote another one, they told me to. This, newest version ofURL
-Encoding is actually pretty successful. It handles all Extra-Characters and is capable of dealing withURL's
that contain the'?' '=' '&'
operators ofGET
-Requests.
Realize that though the out-of-the-box JDK, there is a class called "URI Encoder" - but that class expects that theURL
to have already been separated out into it's distinct parts.
This method does the theURL
-Separating into disparate parts before performing the Character-Escaping.- Parameters:
url
- This is any javaURL
.- Returns:
- a new
String
version of the input parameter'url'
- Code:
- Exact Method Body:
URL u = null; try { u = new URL(url); } catch (Exception e) { return null; } StringBuilder sb = new StringBuilder(); sb.append(u.getProtocol()); sb.append("://"); sb.append(u.getHost()); sb.append(toProperURLV5(u.getPath())); if (u.getQuery() != null) { String[] sArr = u.getQuery().split("&"); StringBuilder sb2 = new StringBuilder(); String ampersand = ""; for (String s : sArr) { String[] s2Arr = s.split("="); StringBuilder sb3 = new StringBuilder(); String equals = ""; for (String s2: s2Arr) { try { sb3.append(equals + java.net.URLEncoder.encode(s2, "UTF-8")); } // This should never happen - UTF-8 is (sort-of) the only encoding. catch (UnsupportedEncodingException e) { } equals = "="; } sb2.append(ampersand + sb3.toString()); ampersand = "&"; } sb.append("?" + sb2.toString()); } // Not really a clue, because a the "#" operator and the "?" probably shouldn't be used // together. Java's java.net.URL class will parse a URL that has both the ? and the #, but // I have no idea which Web-Sites would allow this, or encourage this... if (u.getRef() != null) try { sb.append("#" + java.net.URLEncoder.encode(u.getRef(), "UTF-8")); } catch (UnsupportedEncodingException e) { } return sb.toString();
-
toProperURLV7
public static java.lang.String toProperURLV7(java.lang.String url) throws java.net.URISyntaxException, java.net.MalformedURLException
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.- Parameters:
url
- A Complete JavaURL
, as aString
. Any specialized Escape-Characters that need to be escaped, will be.- Throws:
java.net.URISyntaxException
- This will throw if building theURI
generates an exception. Internally, all this method does is build aURI
, and then call the Java Method'toASCIIString()'
java.net.MalformedURLException
- Code:
- Exact Method Body:
return toProperURLV8(new URL(url));
-
toProperURLV8
public static java.lang.String toProperURLV8(java.net.URL url) throws java.net.URISyntaxException, java.net.MalformedURLException
These strictly use Java's URI Encoding Mechanism. They seem to work the same as "V6" Internally, these are now used. This as of November, 2019.- Parameters:
url
- A Complete JavaURL
. Any specialized Escape-Characters that need to be escaped, will be.- Throws:
java.net.URISyntaxException
- This will throw if building the URI generates an exception. Internally, all this method does is build a URI, and then call the Java Method'toASCIIString()'
java.net.MalformedURLException
- Code:
- Exact Method Body:
return new URI( url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef() ).toASCIIString();
-
removeDuplicates
public static int removeDuplicates(java.util.Vector<java.net.URL> urls)
If you have a list ofURL's
, and want to quickly remove any duplicate-URL's
found in the list - this will remove them.
Case Sensitivity:
This method will perform a few "to-lower-case" operations on the protocol and Web-Domain parts, but not on the file, directory, or Query-String portion of theURL
.
This should hilite what is Case-Sensitive, and what is not:- These are considered duplicate URL's:
http://some.company.com/index.html
HTTP://SOME.COMPANY.COM/index.html
- These are not considered duplicate URL's:
http://other.company.com/Directory/Ben-Bitdiddle.html
http://other.company.com/DIRECTORY/BE.html
- Parameters:
urls
- Any list ofURL's
, some of which might have been duplicated. The difference between this'removeDuplicates'
and the other'removeDuplicates'
available in this class is that this one only removes multiple instances of the sameURL
in thisVector
, while the other one iterates through a list ofURL's
already visited in a previous-session.
NOTE: NullVector
-values are skipped outright, they are neither removed nor changed.- Returns:
- The number of
Vector
elements that were removed. (i.e. The size by which theVector
was shrunk.) - Code:
- Exact Method Body:
TreeSet<String> dups = new TreeSet<>(); int count = 0; int size = urls.size(); URL url = null; for (int i=0; i < size; i++) if ((url = urls.elementAt(i)) != null) if (! dups.add(urlToString(url))) { count++; size--; i--; urls.removeElementAt(i); } return count;
- These are considered duplicate URL's:
-
removeDuplicates
public static int removeDuplicates (java.util.Vector<java.net.URL> visitedURLs, java.util.Vector<java.net.URL> potentiallyNewURLs)
This simple method will remove anyURL's
from the inputVector
parameter'potentiallyNewURLs'
which are also present-members of the inputVector
parameter'visitedURLs'
.
This may seem trivial, and it is, but it worries about things like theString's
Case for you.- Parameters:
visitedURLs
- This parameter is a list ofURL's
that have already "been visited."potentiallyNewURLs
- This parameter is a list ofURL's
that are possibly "un-visited" - meaning whatever scrape, crawl or search being performed needs to know whichURL's
are listed in the previous parameter's contents. This may seem trivial, just use the javaurl1.equals(url2)
command, but, alas, java doesn't exactly take into account upper-case and lower-case domain-names. This worries about case.- Returns:
- The number of
URL's
that were removed from the inputVector
parameter'potentiallyNewURLs'
. - Code:
- Exact Method Body:
// The easiest way to check for duplicates is to build a tree-set of all the URL's as a // String. Java's TreeSet<> generic already (automatically) scans for duplicates // (efficiently) and will tell you if you have tried to add a duplicate TreeSet<String> dups = new TreeSet<>(); // Build a TreeSet of the url's from the "Visited URLs" parameter visitedURLs.forEach(url -> dups.add(urlToString(url))); // Add the "Possibly New URLs", one-by-one, and remove them if they are already in the // visited list. int count = 0; int size = potentiallyNewURLs.size(); URL url = null; for (int i=0; i < size; i++) if ((url = potentiallyNewURLs.elementAt(i)) != null) if (! dups.add(urlToString(url))) { count++; size--; i--; potentiallyNewURLs.removeElementAt(i); } return count;
-
shortenPoundREF
public static java.net.URL shortenPoundREF(java.net.URL url)
Removes any Fragment-URL
'#'
symbols from aURL
.
If thisURL
contains a pound-sign Anchor-Name according to the Standard JDK'sURL.getRef()
method. Specifically, ifURL.getRef()
returns a non-null value, this method rebuilds the URL, without any Anchor-Name / Fragment information.
The intention is to return aURL
where any / allString
-data that occurs after a'#'
Hash-Tab / Pound-Sign is removed.- Parameters:
url
- Any standard HTTPURL
. If this'url'
contains a'#'
(Pound Sign, Partial Reference) - according to the standard JDKURL.getRef()
method, then it shall be removed.- Returns:
- The
URL
without the partial-reference, or the originalURL
if there was no partial reference. Null is returned if there is an error instantiating the newURL
without the partial-reference. - Code:
- Exact Method Body:
try { if (url.getRef() != null) return new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); else return url; } catch (MalformedURLException e) { return null; }
-
shortenPoundREFs
public static int shortenPoundREFs(java.util.Vector<java.net.URL> urls, boolean ifExceptionSetNull)
This may seem like a bad thing to do - it removes all "#Name-Anchor" Fragment-Elements from allURL's
in a list. Generally, a programmer would probably think such links useful.
When performing a News-or-Content Web-Site Scrape, Page-Fragment Links are more easily handled by removing the Hash-Tag'#'
(and everything after it) first. After removing the fragment part of theURL
, then and only then should the download begin.
In case Named-Anchors isn't a familiar term, they areURL's
that have a Hash-Tag'#'
followed by a simple-name placed after the file & directory part of theURL
. They usually look like:<A HREF='SomeFile.html#SomeFragmentName'>
Named-Anchor Fragments & Duplicates:
It is important to realize that when scanning for duplicateURL's
from a list ofURL's
, different Web-Page Fragments of the exact same Web-Page are still duplicates. By eliminating the Named-Anchor part of aURL
, and then scanning the list ofURL's
afterwards, makes finding duplicate downloads a lot easier.
A Anchor-NameURL
will cause a download of the exact-same content either way. The hash-tag ('#') really only affects how a browser renders the page you look at, it does not affect what content is downloaded from the Web-Server at all.
Exception Suppression:
If, in the process of removing theURL
-Fragment, aMalformedURLException
is thrown, it will be caught ans suppressed, not thrown. What is done afterwards is configurable based on the input-parameters to this method.
This can be useful when working with large numbers ofURL's
where only a few of them cause problems while resolving them.- Parameters:
urls
- Any list of completed (read: fully-resolved)URL's
.ifExceptionSetNull
- If this parameter is passedTRUE
, if there is ever an exception-throw while building the newURL's
(without the fragment / pound-sign), then that position in theVector
will be replaced with a null.
When this parameter is passedFALSE
, if an exception is thrown, then it will be caught and silently ignored.- Returns:
- The number / count of
URL's
in this list that were modified. Whenever aURL
Named-Anchor is encountered, it will be removed from theURL
, and a newURL
without the fragment-part will be inserted to replace the old one.
The integer that is returned here is the number of times that a replacement was made to the inputVector
-parameter'urls'
. - Code:
- Exact Method Body:
int pos = 0; int shortenCount = 0; for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); } } return shortenCount;
-
shortenPoundREFs_KE
public static Ret2<java.lang.Integer,java.util.Vector<java.net.MalformedURLException>> shortenPoundREFs_KE (java.util.Vector<java.net.URL> urls, boolean ifExceptionSetNull)
This may seem like a bad thing to do - it removes all "#Name-Anchor" Fragment-Elements from allURL's
in a list. Generally, a programmer would probably think such links useful.
When performing a News-or-Content Web-Site Scrape, Page-Fragment Links are more easily handled by removing the Hash-Tag'#'
(and everything after it) first. After removing the fragment part of theURL
, then and only then should the download begin.
In case Named-Anchors isn't a familiar term, they areURL's
that have a Hash-Tag'#'
followed by a simple-name placed after the file & directory part of theURL
. They usually look like:<A HREF='SomeFile.html#SomeFragmentName'>
Named-Anchor Fragments & Duplicates:
It is important to realize that when scanning for duplicateURL's
from a list ofURL's
, different Web-Page Fragments of the exact same Web-Page are still duplicates. By eliminating the Named-Anchor part of aURL
, and then scanning the list ofURL's
afterwards, makes finding duplicate downloads a lot easier.
A Anchor-NameURL
will cause a download of the exact-same content either way. The hash-tag ('#') really only affects how a browser renders the page you look at, it does not affect what content is downloaded from the Web-Server at all.
Exception Suppression:
If, in the process of removing theURL
-Fragment, aMalformedURLException
is thrown, it will be caught ans suppressed, not thrown. What is done afterwards is configurable based on the input-parameters to this method.
This can be useful when working with large numbers ofURL's
where only a few of them cause problems while resolving them.
KE: Keep Exceptions
This method is identical to the previous method, defined above, except that it allows a programmer to keep / retain anyMalformedURLException's
that are thrown while re-building them.- Parameters:
urls
- Any list of completed (read: fully-resolved)URL's
.ifExceptionSetNull
- If this isTRUE
then if there is ever an exception building a newURL
without a "RelativeURL '#'
" (Pound-Sign), then that position in theVector
will be replaced with 'null.'- Returns:
- The number/count of
URL's
in this list that were modified. If aURL
was modified, it was because it had a partial-page reference in it. If in the process of generating a newURL
out of an old one, aMalformedURLException
occurs, the exception will be placed in theRet2.b
position, which is aVector<MalformedURLException>
.
SPECIFICALLY:-
Ret2.a = 'Integer'
number ofURL's
shortened for having a'#'
partial-reference. -
Ret2.b = Vector<MalformedURLException>
where each element of thisVector
is null if there were no problems converting theURL
, or the exception reference if there were exceptions thrown.
-
- Code:
- Exact Method Body:
int pos = 0; int shortenCount = 0; Vector<MalformedURLException> v = new Vector<>(); for (int i=0; i < urls.size(); i++) v.setElementAt(null, i); for (int i = (urls.size() - 1); i >= 0; i--) { URL url = urls.elementAt(i); try { if (url.getRef() != null) { URL newURL = new URL( ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getFile() != null) ? url.getFile() : "") ); urls.setElementAt(newURL, i); shortenCount++; } } catch (MalformedURLException e) { if (ifExceptionSetNull) urls.setElementAt(null, i); v.setElementAt(e, i); } } return new Ret2<Integer, Vector<MalformedURLException>>(Integer.valueOf(shortenCount), v);
-
urlToString
public static java.lang.String urlToString(java.net.URL url)
On the internet, aURL
is part case-sensitive, and part case-insensitive. The Domain-Name and Protocol (http://
, and'some.company.com'
) portions of theURL
are Case-Insensitive - they may be in any combination of upper or lower case.
However, the directory, file-name, and (optional) Query-String
portion of aURL
are (often, but not always) Case-Sensitive. The sensitivity to case in these three parts of aURL
is dependent upon the individual Web-Server that is providing the content for theURL
.
To summarize, DNS servers which monitor the Domain-Name part of aURL
treat upper & lower case English-Letters as the same. Web-Server that utilize the File Directory part of aURL
will sometimes care about case, and sometimes won't. This behavior is dependent upon how the Web-Master has configured his system.- Parameters:
url
- This may be any Internet-DomainURL
- Returns:
- A
String
version of thisURL
, but the domain and protocol portions of theURL
will be a "consistent" lower case. The case of the directory, file and (possibly, but not guaranteed to be present)query-string
portion will not have their case modified either way.
NOTE: This type of information is pretty important is you are attempting to scan for duplicateURL's
or check their equality. - Code:
- Exact Method Body:
return ((url.getProtocol() != null) ? url.getProtocol().toLowerCase() : "") + "://" + ((url.getHost() != null) ? url.getHost().toLowerCase() : "") + ((url.getPath() != null) ? url.getPath() : "") + ((url.getQuery() != null) ? ('?' + url.getQuery()) : "") + ((url.getRef() != null) ? ('#' + url.getRef()) : "");
-
CURL
public static void CURL(java.net.URL url, java.lang.String outFileName, java.lang.String userAgent) throws java.io.IOException
As of today, the version of UNIXcurl
command does not seem to be downloading everything properly. It downloaded an image'.png'
file just fine, but seemed to have botched a zip-file. This does what UNIX'curl'
command, but does not actually invoke the UNIX operating system to do it. It just does this...- Parameters:
url
- This may be any URL, but it is intended to be a downloadable file. It will download'.html'
files fine, but you may try images, data-files, zip-files, tar-archives, and movies.outFileName
- You must specify a file-name, and if this parameter is null, aNullPointerException
will be thrown immediately. If you would like your program to guess the filename - based on the file named in the URL, please use the methodURL.getFile()
, or something to that effect.userAgent
- A User-Agent, as aString
. If this parameter is passed null, it will be silently ignored, and a User-Agent won't be used.- Throws:
java.io.IOException
- If there are I/O Errors when using theHttpURLConnection
.- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (userAgent != null) con.setRequestProperty("User-Agent", userAgent); InputStream is = con.getInputStream(); FileOutputStream fos = new FileOutputStream(outFileName); byte[] b = new byte[5000]; int result = 0; while ((result = is.read(b)) != -1) fos.write(b, 0, result); fos.flush(); fos.close(); is.close();
-
-