Package Torello.HTML
Class Scrape
- java.lang.Object
-
- Torello.HTML.Scrape
-
public class Scrape extends java.lang.Object
Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a JavaString
- which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using classHTMLPage
). This class just simplifies some of the typing for common Java Network Connection / HTTP Connection code.
TheopenConn(args)
methods open different types of connections to web-servers.
Connection Types:
It is important to note what the major differences between web-connections are. If a user is receiving simple-ASCII - these connections will leave out 100% of the "higher order"UTF-8 characters
(many of which are foreign language characters).
Often-times a usual java web-connection method will suffice, but not always. If a website does not use any characters that range above ASCII 255, then the usualBufferedReader
connection is just fine.UTF-8
however, is a very commonly used internet connection character-set protocol. It includes everything from Spanish Accent Characters to many Chinese Mandarin Characters and tens of thousands of other characters which are all listed in the UTF-8 specifications.
Theiso_8859_1
version I was forced to use once for a site from Spain involving the famous book by Cervantes, although I'm not completely certain how this standard works - and have only been expected to use this connection type twice. UTF-8, on the other-hand is used on 70% of the websites that I have parsed.
Hi-Lited Source-Code:- View Here: Torello/HTML/Scrape.java
- Open New Browser-Tab: Torello/HTML/Scrape.java
File Size: 32,962 Bytes Line Count: 775 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 21 Method(s), 21 declared static
- 2 Field(s), 2 declared static, 0 declared final
- Fields excused from final modifier (with explanation):
Field 'USER_AGENT' is not final. Reason: CONFIGURATION Field 'USE_USER_AGENT' is not final. Reason: FLAG
-
-
Field Summary
Fields Modifier and Type Field static boolean
USE_USER_AGENT
static String
USER_AGENT
-
Method Summary
Open HTTP Connection, Get Reader Modifier and Type Method static BufferedReader
openConn(String url)
static BufferedReader
openConn(URL url)
static BufferedReader
openConn_iso_8859_1(String url)
static BufferedReader
openConn_iso_8859_1(URL url)
static BufferedReader
openConn_UTF8(String url)
static BufferedReader
openConn_UTF8(URL url)
Open HTTP Connection, Get Reader & Headers Modifier and Type Method static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader(URL url)
static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader_iso_8859_1(URL url)
static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader_UTF8(URL url)
Read / Scrape Contents to String Modifier and Type Method static String
scrapePage(BufferedReader br)
static String
scrapePage(String url)
static String
scrapePage(URL url)
Read / Scrape Contents to Vector<String> Modifier and Type Method static Vector<String>
scrapePageToVector(BufferedReader br, boolean includeNewLine)
static Vector<String>
scrapePageToVector(String url, boolean includeNewLine)
static Vector<String>
scrapePageToVector(URL url, boolean includeNewLine)
Read / Scrape Contents to StringBuffer, Range-Limited Modifier and Type Method static StringBuffer
getHTML(BufferedReader br, int startLineNum, int endLineNum)
static StringBuffer
getHTML(BufferedReader br, String startTag, String endTag)
HTTP Header Methods Modifier and Type Method static InputStream
checkHTTPCompression(Map<String,List<String>> httpHeaders, InputStream is)
static String
httpHeadersToString(Map<String,List<String>> httpHeaders)
static boolean
usesDeflate(Map<String,List<String>> httpHeaders)
static boolean
usesGZIP(Map<String,List<String>> httpHeaders)
-
-
-
Field Detail
-
USER_AGENT
public static java.lang.String USER_AGENT
When opening anHTTP URL
connection, it is usually a good idea to use a"User Agent"
The default behavior in this Scrape & Search Package is to connect using thepublic static String USER_AGENT = "Chrome/61.0.3163.100";
NOTE: This behavior may be changed by modifying thesepublic static
variables.
ALSO: If the booleanUSE_USER_AGENT
is set toFALSE
, then no User-Agent will be used at all.
-
USE_USER_AGENT
public static boolean USE_USER_AGENT
When opening anHTTP URL
connection, it is usually a good idea to use a"User Agent"
The default behavior in this Scrape & Search Package is to connect using thepublic static String USER_AGENT = "Chrome/61.0.3163.100";
NOTE: This behavior may be changed by modifying thesepublic static
variables.
ALSO: If this boolean is set toFALSE
, then no User-Agent will be used at all.
-
-
Method Detail
-
usesGZIP
public static boolean usesGZIP (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method will check whether theHTTP Header
returned by a website has been encoded using theGZIP Compression
encoding. It expects thejava.util.Map
that is returned from an invocation ofHttpURLConnection.getHeaderFields()
.
Case-Insensitive:
SinceHTTP Headers
are considered CASE INSENSITIVE, allString
comparisons done in this method shall ignore case.- Parameters:
httpHeaders
- This is a simplyjava.util.Map<String, List<String>>
. It must be the exact map that is returned by theHttpURLConnection
.- Returns:
- If this map contains a property named
"Content-Encoding"
AND this property has a property-value in it's list equal to"gzip"
, then this method will returnTRUE
. Otherwise this method will returnFALSE
. - Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The Map's returned have been known to contain null keys, so check for that here. if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding" // is "GZIP". If this is found, return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return true; // The property-value "GZIP" wasn't found, so return FALSE. return false;
-
usesDeflate
public static boolean usesDeflate (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method will check whether theHTTP Header
returned by a website has been encoded using theZIP Compression (PKZIP, Deflate)
encoding. It expects thejava.util.Map
that is returned from an invokation ofHttpURLConnection.getHeaderFields()
.- Parameters:
httpHeaders
- This is a simplyjava.util.Map<String, List<String>>
. It must be the exact map that is returned by theHttpURLConnection
.- Returns:
- If this map contains a property named
"Content-Encoding"
AND this property has a property-value in it's list equal to"deflate"
, then this method will returnTRUE
. Otherwise this method will returnFALSE
.
NOTE: SinceHTTP Headers
are considered CASE INSENSITIVE, allString
comparisons done in this method shall ignore case. - Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" are // "DEFLATE" - then return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("deflate")) return true; // The property-value "deflate" wasn't found, so return FALSE. return false;
-
checkHTTPCompression
public static java.io.InputStream checkHTTPCompression (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders, java.io.InputStream is) throws java.io.IOException
This method will check whether theHTTP Header
returned by a website has been encoded using compression. It expects thejava.util.Map
that is returned from an invokation ofHttpURLConnection.getHeaderFields()
.- Parameters:
httpHeaders
- This is a simplyjava.util.Map<String, List<String>>
. It must be the exact map that is returned by theHttpURLConnection
.is
- This should be theInputStream
that is returned from theHttpURLConnection
when reqesting the content from the web-server that is hosting theURL
. TheHTTP Headers
will be searched, and if a compression algorithm has been specified (and the algorithm is one of the algorithm's automatically handled by Java) - then thisInputStream
shall be wrapped by the appropriate decompression algorithm.- Returns:
- If this map contains a property named
"Content-Encoding"
AND this property has a property-value in it's list equal to either"deflate"
or"gzip"
, then this shall return a wrappedInputStream
that is capable of handling the decompression algorithm.
NOTE: SinceHTTP Headers
are considered CASE INSENSITIVE, allString
comparisons done in this method shall ignore case. - Throws:
java.io.IOException
- Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" // are "DEFLATE" or "GZIP" - then return the compression-algorithm immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return new GZIPInputStream(is); else if (vals.equalsIgnoreCase("deflate")) return new ZipInputStream(is); // Neither of the property-values "gzip" or "deflate" were found. // Return the original input stream. return is;
-
httpHeadersToString
public static java.lang.String httpHeadersToString (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method shall simply take as input ajava.util.Map
which contains theHTTP Header
properties that must have been generated by a call to the methodHttpURLConnection.getHeaderFields()
. It will produce a JavaString
that lists these headers in text / readable format.- Parameters:
httpHeaders
- This parameter must be an instance ofjava.util.Map<String, List<String>>
and it should have been generated by a call toHttpURLConnection.getHeaderFields()
. The property names and values contained by thisMap
will be iterated and printed to a returnedjava.lang.String
.- Returns:
- This shall return a printed version of the
Map
. - Code:
- Exact Method Body:
StringBuilder sb = new StringBuilder(); int max = 0; // To ensure that the output string is "aligned", check the length of each of the // keys in the HTTP Header. for (String key : httpHeaders.keySet()) if (key.length() > max) max = key.length(); max += 5; // Iterate all of the Properties that are included in the 'httpHeaders' parameter // It is important to note that the java "toString()" method for the List<String> that // is used to store the Property-Values list works great, without any changes. for (String key : httpHeaders.keySet()) sb.append( StringParse.rightSpacePad(key + ':', max) + httpHeaders.get(key).toString() + '\n' ); return sb.toString();
-
openConn
public static java.io.BufferedReader openConn(java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn(new URL(url));
-
openConn
public static java.io.BufferedReader openConn(java.net.URL url) throws java.io.IOException
Opens a standard connection to aURL
, and returns aBufferedReader
for reading from it.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connection
can be controlled from twopublic static
fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'
field of an'HttpURLConnection'
even signifies.- Parameters:
url
- This may be an Internet-URL.
- Returns:
- A java
BufferedReader
for retrieving the data from the internet connection. - Throws:
java.io.IOException
- See Also:
USER_AGENT
,USE_USER_AGENT
,checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is));
-
openConnGetHeader
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader (java.net.URL url) throws java.io.IOException
Opens aUTF8
connection to aURL
, and returns aBufferedReader
for reading it, and also theHTTP Header
that was returned by theHTTP Server
.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.- Parameters:
url
- This may be an InternetURL
.- Returns:
- This shall return an instance of
class Ret2
. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReader
that shall retrieve the HTTP Response from theURL
provided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>
which will contain the HTTP Headers which are returned by theHTTP Server
associated with theURL
provided to this method.
NOTE: ThisHTTP Header
is obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException
- See Also:
checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>> (new BufferedReader(new InputStreamReader(is)), httpHeaders);
-
openConn_iso_8859_1
public static java.io.BufferedReader openConn_iso_8859_1 (java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn_iso_8859_1(new URL(url));
-
openConn_iso_8859_1
public static java.io.BufferedReader openConn_iso_8859_1 (java.net.URL url) throws java.io.IOException
Will open anISO-8859
connection to aURL
, and returns aBufferedReader
for reading it.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connection
can be controlled from twopublic static
fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'
field of an'HttpURLConnection'
even signifies.- Parameters:
url
- This may be an InternetURL
. The site and page to which it points should return data encoded in theISO-8859
charset.- Returns:
- A java
BufferedReader
for retrieving the data from the internet connection. - Throws:
java.io.IOException
- See Also:
USER_AGENT
,USE_USER_AGENT
,checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "text/html; charset=iso-8859-1"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("iso-8859-1")));
-
openConnGetHeader_iso_8859_1
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_iso_8859_1 (java.net.URL url) throws java.io.IOException
Opens aISO-8859-1
connection to aURL
, and returns aBufferedReader
for reading it, and also theHTTP Header
that was returned by theHTTP Server
.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.- Parameters:
url
- This may be an InternetURL
. The site and page to which it points should return data encoded in theISO-8859-1
charset.- Returns:
- This shall return an instance of
class Ret2
. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReader
that shall retrieve the HTTP Response from theURL
provided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>
which will contain the HTTP Headers which are returned by theHTTP Server
associated with theURL
provided to this method.
NOTE: ThisHTTP Header
is obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException
- See Also:
checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=iso-8859-1"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("charset=iso-8859-1"))), httpHeaders );
-
openConn_UTF8
public static java.io.BufferedReader openConn_UTF8(java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn_UTF8(new URL(url));
-
openConn_UTF8
public static java.io.BufferedReader openConn_UTF8(java.net.URL url) throws java.io.IOException
Opens aUTF8
connection to aURL
, and returns aBufferedReader
for reading it.
UTF-8 Character-Encoding:
For all intents and purposes, Java's internalclass HttpURLConnection
will handle any receivedUTF-8
content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the@Deprecated
list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
Please note that any attempt to connect to, or retrieve,'.html'
content from a web-server that is returning thecharset=UTF-8
is done by the JRE using with ease since the Java primitive-typechar
is a 16-bit type. Instead, the methodsopenConn(String)
andopenConn(URL)
, etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connection
can be controlled from twopublic static
fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'
field of an'HttpURLConnection'
even signifies.- Parameters:
url
- This may be an InternetURL
. The site and page to which it points should return data encoded in theUTF-8
charset.- Returns:
- A java
BufferedReader
for retrieving the data from the internet connection. - Throws:
java.io.IOException
- See Also:
USER_AGENT
,USE_USER_AGENT
,checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
-
openConnGetHeader_UTF8
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_UTF8 (java.net.URL url) throws java.io.IOException
Opens aUTF8
connection to aURL
, and returns aBufferedReader
for reading it, and also theHTTP Header
that was returned by theHTTP Server
.
UTF-8 Character-Encoding:
For all intents and purposes, Java's internalclass HttpURLConnection
will handle any receivedUTF-8
content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the@Deprecated
list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
Please note that any attempt to connect to, or retrieve,'.html'
content from a web-server that is returning thecharset=UTF-8
is done by the JRE using with ease since the Java primitive-typechar
is a 16-bit type. Instead, the methodsopenConn(String)
andopenConn(URL)
, etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazon
occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'
and'deflate'
) have been specified in theHTTP Header
that is returned. If it is, the'.html' File
received will be decompressed first.
It should be of note that'gzip'
and'deflate'
are not the only compression algorithms that may be specified in anHTTP Header
- there are others (tar-gzip, pack200, zStandard,
etc...). If another compression algorithm is in use by a web-server for a specificURL
, then manually selecting a decompression algorithm would be necessary.- Parameters:
url
- This may be an InternetURL
. The site and page to which it points should return data encoded in theUTF-8
charet.- Returns:
- This shall return an instance of
class Ret2
. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReader
that shall retrieve the HTTP Response from theURL
provided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>
which will contain the HTTP Headers which are returned by theHTTP Server
associated with theURL
provided to this method.
NOTE: ThisHTTP Header
is obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException
- See Also:
checkHTTPCompression(Map, InputStream)
- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8"))), httpHeaders );
-
scrapePage
public static java.lang.String scrapePage(java.lang.String url) throws java.io.IOException
Convenience Method
Invokes:scrapePage(BufferedReader)
Obtains:BufferedReader
fromopenConn(String)
- Code:
- Exact Method Body:
return scrapePage(openConn(url));
-
scrapePage
public static java.lang.String scrapePage(java.net.URL url) throws java.io.IOException
- Code:
- Exact Method Body:
return scrapePage(openConn(url));
-
scrapePage
public static java.lang.String scrapePage(java.io.BufferedReader br) throws java.io.IOException
This scrapes a website and dumps the entire contents into ajava.lang.String
.- Parameters:
br
- This is aReader
that needs to have been connected to a Website that will output text/html data.- Returns:
- The text/html data - returned inside a
String
- Throws:
java.io.IOException
- Code:
- Exact Method Body:
StringBuffer sb = new StringBuffer(); String s; while ((s = br.readLine()) != null) sb.append(s + "\n"); return sb.toString();
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.lang.String url, boolean includeNewLine) throws java.io.IOException
Convenience Method
Invokes:scrapePageToVector(BufferedReader, boolean)
Obtains:BufferedReader
fromopenConn(String)
- Code:
- Exact Method Body:
return scrapePageToVector(openConn(url), includeNewLine);
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.net.URL url, boolean includeNewLine) throws java.io.IOException
Convenience Method
Invokes:scrapePageToVector(BufferedReader, boolean)
Obtains:Bufferedeader
fromopenConn(URL)
- Code:
- Exact Method Body:
return scrapePageToVector(openConn(url), includeNewLine);
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.io.BufferedReader br, boolean includeNewLine) throws java.io.IOException
This will scrape the entire contents of an HTML page to aVector<String>
Each line of the text/HTML page is demarcated by the reception of a'\n'
character from the web-server.- Parameters:
br
- This is the input source of the HTML page. It will query for String data.includeNewLine
- This will append the'\n'
character to the end of eachString
in theVector
.- Returns:
- a
Vector
ofString's
where eachString
is a line on the web-page. - Throws:
java.io.IOException
- See Also:
scrapePageToVector(String, boolean)
- Code:
- Exact Method Body:
Vector<String> ret = new Vector<>(); String s = null; if (includeNewLine) while ((s = br.readLine()) != null) ret.add(s + '\n'); else while ((s = br.readLine()) != null) ret.add(s); return ret;
-
getHTML
public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException
This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as aString.
This is called "scraping HTML."- Parameters:
startTag
- If this is null, the scrape will begin with the first character received. If this contains aString
, the scrape will not include any text/HTML data that occurs prior to the first occurrence of'startTag'
endTag
- If this is null, the scrape will read the entire contents of text/HTML data from theBufferedreader br
parameter. If this contains aString
, then data will be read and included in the result until'endTag'
is received.- Returns:
- a
StringBuffer
that is text/html data retrieved from the Reader. CalltoString()
on the return value to retrieve thatString.
- Throws:
ScrapeException
- If, after download completes, either the'startTag'
or the parameter'endTag'
do not representString's
that were found within the downloaded page, this exception is thrown.java.io.IOException
- Code:
- Exact Method Body:
StringBuffer html = new StringBuffer(); String s; // Nice Long Name... Guess what it means boolean alreadyFoundEndTagInStartTagLine = false; // If the startTag parameter is not null, skip all content, until the startTag is found! if (startTag != null) { boolean foundStartTag = false; while ((s = br.readLine()) != null) if (s.contains(startTag)) { int startTagPos = s.indexOf(startTag); foundStartTag = true; // NOTE: Sometimes the 'startTag' and 'endTag' are on the same line! // This happens, for instance, on Yahoo Photos, when giant lines // (no line-breaks) are transmitted // Hence... *really* long variable name, this is confusing! s = s.substring(startTagPos); if ((endTag != null) && s.contains(endTag)) { s = s.substring(0, s.indexOf(endTag) + endTag.length()); alreadyFoundEndTagInStartTagLine = true; } html.append(s + "\n"); break; } if (! foundStartTag) throw new ScrapeException ("Start Tag: '" + startTag + "' was Not Found on Page."); } // if the endTag parameter is not null, stop reading as soon as the end-tag is found if (endTag != null) { // NOTE: This 'if' is inside curly-braces, because there is an 'else' that "goes with" // the 'if' above... BUT NOT the following 'if' if (! alreadyFoundEndTagInStartTagLine) { boolean foundEndTag = false; while ((s = br.readLine()) != null) if (s.contains(endTag)) { foundEndTag = true; int endTagPos = s.indexOf(endTag); html.append(s.substring(0, endTagPos + endTag.length()) + "\n"); break; } else html.append(s + "\n"); if (! foundEndTag) throw new ScrapeException ("End Tag: '" + endTag + "' was Not Found on Page."); } } // ELSE: (endTag *was* null) ... read all content until EOF ... or ... "EOWP" (end of web-page) else while ((s = br.readLine()) != null) html.append(s + "\n"); // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;
-
getHTML
public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, int startLineNum, int endLineNum) throws java.io.IOException
This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as aString.
This is called "scraping HTML."- Parameters:
startLineNum
- If this is'0'
or'1'
, the scrape will begin with the first character received. If this contains a positive integer, the scrape will not include any text/HTML data that occurs prior toint startLineNum
lines of text/html having been received.endLineNum
- If this is negative, the scrape will read the entire contents of text/HTML data from theBufferedreader br
parameter (untilEOF
is encountered). If this contains a positive integer, then data will be read and included in the result untilint endLineNum
lines of text/html have been received.- Returns:
- a
StringBuffer
that is text/html data retrieved from the Reader. CalltoString()
on the return value to retrieve thatString
- Throws:
java.lang.IllegalArgumentException
- If parameter'startLineNum'
is negative or greater than'endLineNum'
If'endLineNum'
was negative, this test is skipped.ScrapeException
- If there were not enough lines read from theBufferedReader
parameter to be consistent with the values in'startLineNum'
and'endLineNum'
java.io.IOException
- Code:
- Exact Method Body:
StringBuffer html = new StringBuffer(); String s = ""; // NOTE: Arrays start at 0, **BUT** HTML page line counts start at 1! int curLineNum = 1; if (startLineNum < 0) throw new IllegalArgumentException( "The parameter startLineNum is negative: " + startLineNum + " but this is not " + "allowed." ); if (endLineNum == 0) throw new IllegalArgumentException ("The parameter endLineNum is zero, but this is not allowed."); endLineNum = (endLineNum < 0) ? 1 : endLineNum; startLineNum = (startLineNum == 0) ? 1 : startLineNum; if ((endLineNum < startLineNum) && (endLineNum != 1)) throw new IllegalArgumentException( "The parameter startLineNum is: " + startLineNum + "\n" + "The parameter endLineNum is: " + endLineNum + "\n" + "It is required that the latter is larger than the former, " + "or it must be 0 or negative to signify read until EOF." ); if (startLineNum > 1) { while (curLineNum++ < startLineNum) if (br.readLine() == null) throw new ScrapeException( "The HTML Page that was given didn't even have enough lines to read " + "quantity in variable startLineNum.\nstartLineNum = " + startLineNum + " and read " + (curLineNum-1) + " line(s) before EOF." ); // Off-By-One computer science error correction - remember post-decrement, means the // last loop iteration didn't read line, but did increment the loop counter! curLineNum--; } // endLineNum==1 means/imples that we don't have to heed the // endLineNum variable ==> read to EOF/null! if (endLineNum == 1) while ((s = br.readLine()) != null) html.append(s + "\n"); // endLineNum > 1 ==> Head endLineNum variable! else { // System.out.println("At START of LOOP: curLineNum = " + curLineNum + // " and endLineNum = " + endLineNum); for ( ;curLineNum <= endLineNum; curLineNum++) if ((s = br.readLine()) != null) html.append(s + "\n"); else break; // NOTE: curLineNum-1 and endLineNum+1 are used because: // // ** The loop counter (curLineNum) breaks when the next line to read is the one // passed the endLineNum // ** endLineNum+1 is the appropriate state if enough lines were read from the // HTML Page // ** curLineNum-1 is the number of the last line read from the HTML if (curLineNum != (endLineNum+1)) throw new ScrapeException( "The HTML Page that was read didn't have enough lines to read to quantity in " + "variable endLineNum.\nendLineNum = " + endLineNum + " but only read " + (curLineNum-1) + " line(s) before EOF." ); } // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;
-
-