Package Torello.HTML
Class Scrape
- java.lang.Object
-
- Torello.HTML.Scrape
-
public class Scrape extends java.lang.Object
Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a JavaString- which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using classHTMLPage). This class just simplifies some of the typing for common Java Network Connection / HTTP Connection code.
TheopenConn(args)methods open different types of connections to web-servers.
Connection Types:
It is important to note what the major differences between web-connections are. If a user is receiving simple-ASCII - these connections will leave out 100% of the "higher order"UTF-8 characters(many of which are foreign language characters).
Often-times a usual java web-connection method will suffice, but not always. If a website does not use any characters that range above ASCII 255, then the usualBufferedReaderconnection is just fine.UTF-8however, is a very commonly used internet connection character-set protocol. It includes everything from Spanish Accent Characters to many Chinese Mandarin Characters and tens of thousands of other characters which are all listed in the UTF-8 specifications.
Theiso_8859_1version I was forced to use once for a site from Spain involving the famous book by Cervantes, although I'm not completely certain how this standard works - and have only been expected to use this connection type twice. UTF-8, on the other-hand is used on 70% of the websites that I have parsed.
Hi-Lited Source-Code:- View Here: Torello/HTML/Scrape.java
- Open New Browser-Tab: Torello/HTML/Scrape.java
File Size: 33,015 Bytes Line Count: 773 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@StatelessAnnotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 21 Method(s), 21 declared static
- 2 Field(s), 2 declared static, 0 declared final
- Fields excused from final modifier (with explanation):
Field 'USER_AGENT' is not final. Reason: CONFIGURATION Field 'USE_USER_AGENT' is not final. Reason: FLAG
-
-
Field Summary
Fields Modifier and Type Field static booleanUSE_USER_AGENTstatic StringUSER_AGENT
-
Method Summary
Open HTTP Connection, Get Reader Modifier and Type Method static BufferedReaderopenConn(String url)static BufferedReaderopenConn(URL url)static BufferedReaderopenConn_iso_8859_1(String url)static BufferedReaderopenConn_iso_8859_1(URL url)static BufferedReaderopenConn_UTF8(String url)static BufferedReaderopenConn_UTF8(URL url)Open HTTP Connection, Get Reader & Headers Modifier and Type Method static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader(URL url)static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader_iso_8859_1(URL url)static Ret2<BufferedReader,
Map<String,
List<String>>>openConnGetHeader_UTF8(URL url)Read / Scrape Contents to String Modifier and Type Method static StringscrapePage(BufferedReader br)static StringscrapePage(String url)static StringscrapePage(URL url)Read / Scrape Contents to Vector<String> Modifier and Type Method static Vector<String>scrapePageToVector(BufferedReader br, boolean includeNewLine)static Vector<String>scrapePageToVector(String url, boolean includeNewLine)static Vector<String>scrapePageToVector(URL url, boolean includeNewLine)Read / Scrape Contents to StringBuffer, Range-Limited Modifier and Type Method static StringBuffergetHTML(BufferedReader br, int startLineNum, int endLineNum)static StringBuffergetHTML(BufferedReader br, String startTag, String endTag)HTTP Header Methods Modifier and Type Method static InputStreamcheckHTTPCompression(Map<String,List<String>> httpHeaders, InputStream is)static StringhttpHeadersToString(Map<String,List<String>> httpHeaders)static booleanusesDeflate(Map<String,List<String>> httpHeaders)static booleanusesGZIP(Map<String,List<String>> httpHeaders)
-
-
-
Field Detail
-
USER_AGENT
public static java.lang.String USER_AGENT
When opening anHTTP URLconnection, it is usually a good idea to use a"User Agent"The default behavior in this Scrape & Search Package is to connect using thepublic static String USER_AGENT = "Chrome/61.0.3163.100";This behavior may be changed by modifying thesepublic staticvariables.
If the booleanUSE_USER_AGENTis set toFALSE, then no User-Agent will be used at all.- Code:
- Exact Field Declaration Expression:
public static String USER_AGENT = "Chrome/61.0.3163.100";
-
USE_USER_AGENT
public static boolean USE_USER_AGENT
When opening anHTTP URLconnection, it is usually a good idea to use a"User Agent"The default behavior in this Scrape & Search Package is to connect using thepublic static String USER_AGENT = "Chrome/61.0.3163.100";This behavior may be changed by modifying thesepublic staticvariables.
If the booleanUSE_USER_AGENTis set toFALSE, then no User-Agent will be used at all.- Code:
- Exact Field Declaration Expression:
public static boolean USE_USER_AGENT = true;
-
-
Method Detail
-
usesGZIP
public static boolean usesGZIP (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method will check whether theHTTP Headerreturned by a website has been encoded using theGZIP Compressionencoding. It expects thejava.util.Mapthat is returned from an invocation ofHttpURLConnection.getHeaderFields().
Case-Insensitive:
SinceHTTP Headersare considered CASE INSENSITIVE, allStringcomparisons done in this method shall ignore case.- Parameters:
httpHeaders- This is a simplyjava.util.Map<String, List<String>>. It must be the exact map that is returned by theHttpURLConnection.- Returns:
- If this map contains a property named
"Content-Encoding"AND this property has a property-value in it's list equal to"gzip", then this method will returnTRUE. Otherwise this method will returnFALSE. - Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The Map's returned have been known to contain null keys, so check for that here. if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding" // is "GZIP". If this is found, return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return true; // The property-value "GZIP" wasn't found, so return FALSE. return false;
-
usesDeflate
public static boolean usesDeflate (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method will check whether theHTTP Headerreturned by a website has been encoded using theZIP Compression (PKZIP, Deflate)encoding. It expects thejava.util.Mapthat is returned from an invokation ofHttpURLConnection.getHeaderFields().- Parameters:
httpHeaders- This is a simplyjava.util.Map<String, List<String>>. It must be the exact map that is returned by theHttpURLConnection.- Returns:
- If this map contains a property named
"Content-Encoding"AND this property has a property-value in it's list equal to"deflate", then this method will returnTRUE. Otherwise this method will returnFALSE.SinceHTTP Headersare considered CASE INSENSITIVE, allStringcomparisons done in this method shall ignore case. - Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" are // "DEFLATE" - then return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("deflate")) return true; // The property-value "deflate" wasn't found, so return FALSE. return false;
-
checkHTTPCompression
public static java.io.InputStream checkHTTPCompression (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders, java.io.InputStream is) throws java.io.IOException
This method will check whether theHTTP Headerreturned by a website has been encoded using compression. It expects thejava.util.Mapthat is returned from an invokation ofHttpURLConnection.getHeaderFields().- Parameters:
httpHeaders- This is a simplyjava.util.Map<String, List<String>>. It must be the exact map that is returned by theHttpURLConnection.is- This should be theInputStreamthat is returned from theHttpURLConnectionwhen reqesting the content from the web-server that is hosting theURL. TheHTTP Headerswill be searched, and if a compression algorithm has been specified (and the algorithm is one of the algorithm's automatically handled by Java) - then thisInputStreamshall be wrapped by the appropriate decompression algorithm.- Returns:
- If this map contains a property named
"Content-Encoding"AND this property has a property-value in it's list equal to either"deflate"or"gzip", then this shall return a wrappedInputStreamthat is capable of handling the decompression algorithm.SinceHTTP Headersare considered CASE INSENSITIVE, allStringcomparisons done in this method shall ignore case. - Throws:
java.io.IOException- Code:
- Exact Method Body:
// NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" // are "DEFLATE" or "GZIP" - then return the compression-algorithm immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return new GZIPInputStream(is); else if (vals.equalsIgnoreCase("deflate")) return new ZipInputStream(is); // Neither of the property-values "gzip" or "deflate" were found. // Return the original input stream. return is;
-
httpHeadersToString
public static java.lang.String httpHeadersToString (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
This method shall simply take as input ajava.util.Mapwhich contains theHTTP Headerproperties that must have been generated by a call to the methodHttpURLConnection.getHeaderFields(). It will produce a JavaStringthat lists these headers in text / readable format.- Parameters:
httpHeaders- This parameter must be an instance ofjava.util.Map<String, List<String>>and it should have been generated by a call toHttpURLConnection.getHeaderFields(). The property names and values contained by thisMapwill be iterated and printed to a returnedjava.lang.String.- Returns:
- This shall return a printed version of the
Map. - Code:
- Exact Method Body:
StringBuilder sb = new StringBuilder(); int max = 0; // To ensure that the output string is "aligned", check the length of each of the // keys in the HTTP Header. for (String key : httpHeaders.keySet()) if (key.length() > max) max = key.length(); max += 5; // Iterate all of the Properties that are included in the 'httpHeaders' parameter // It is important to note that the java "toString()" method for the List<String> that // is used to store the Property-Values list works great, without any changes. for (String key : httpHeaders.keySet()) sb.append( StringParse.rightSpacePad(key + ':', max) + httpHeaders.get(key).toString() + '\n' ); return sb.toString();
-
openConn
public static java.io.BufferedReader openConn(java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn(new URL(url));
-
openConn
public static java.io.BufferedReader openConn(java.net.URL url) throws java.io.IOException
Opens a standard connection to aURL, and returns aBufferedReaderfor reading from it.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connectioncan be controlled from twopublic staticfields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'field of an'HttpURLConnection'even signifies.- Parameters:
url- This may be an Internet-URL.- Returns:
- A java
BufferedReaderfor retrieving the data from the internet connection. - Throws:
java.io.IOException- See Also:
USER_AGENT,USE_USER_AGENT,checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is));
-
openConnGetHeader
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader (java.net.URL url) throws java.io.IOException
Opens aUTF8connection to aURL, and returns aBufferedReaderfor reading it, and also theHTTP Headerthat was returned by theHTTP Server.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.- Parameters:
url- This may be an InternetURL.- Returns:
- This shall return an instance of
class Ret2. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReaderthat shall retrieve the HTTP Response from theURLprovided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>which will contain the HTTP Headers which are returned by theHTTP Serverassociated with theURLprovided to this method.ThisHTTP Headeris obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException- See Also:
checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>> (new BufferedReader(new InputStreamReader(is)), httpHeaders);
-
openConn_iso_8859_1
public static java.io.BufferedReader openConn_iso_8859_1 (java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn_iso_8859_1(new URL(url));
-
openConn_iso_8859_1
public static java.io.BufferedReader openConn_iso_8859_1 (java.net.URL url) throws java.io.IOException
Will open anISO-8859connection to aURL, and returns aBufferedReaderfor reading it.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connectioncan be controlled from twopublic staticfields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'field of an'HttpURLConnection'even signifies.- Parameters:
url- This may be an InternetURL. The site and page to which it points should return data encoded in theISO-8859charset.- Returns:
- A java
BufferedReaderfor retrieving the data from the internet connection. - Throws:
java.io.IOException- See Also:
USER_AGENT,USE_USER_AGENT,checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "text/html; charset=iso-8859-1"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("iso-8859-1")));
-
openConnGetHeader_iso_8859_1
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_iso_8859_1 (java.net.URL url) throws java.io.IOException
Opens aISO-8859-1connection to aURL, and returns aBufferedReaderfor reading it, and also theHTTP Headerthat was returned by theHTTP Server.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.- Parameters:
url- This may be an InternetURL. The site and page to which it points should return data encoded in theISO-8859-1charset.- Returns:
- This shall return an instance of
class Ret2. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReaderthat shall retrieve the HTTP Response from theURLprovided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>which will contain the HTTP Headers which are returned by theHTTP Serverassociated with theURLprovided to this method.ThisHTTP Headeris obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException- See Also:
checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=iso-8859-1"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("charset=iso-8859-1"))), httpHeaders );
-
openConn_UTF8
public static java.io.BufferedReader openConn_UTF8(java.lang.String url) throws java.io.IOException
- Code:
- Exact Method Body:
return openConn_UTF8(new URL(url));
-
openConn_UTF8
public static java.io.BufferedReader openConn_UTF8(java.net.URL url) throws java.io.IOException
Opens aUTF8connection to aURL, and returns aBufferedReaderfor reading it.
UTF-8 Character-Encoding:
For all intents and purposes, Java's internalclass HttpURLConnectionwill handle any receivedUTF-8content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the@Deprecatedlist, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
Please note that any attempt to connect to, or retrieve,'.html'content from a web-server that is returning thecharset=UTF-8is done by the JRE using with ease since the Java primitive-typecharis a 16-bit type. Instead, the methodsopenConn(String)andopenConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.
Employing a User-Agent:
The inclusion of the "User Agent" field in thisURL Connectioncan be controlled from twopublic staticfields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the'USER_AGENT'field of an'HttpURLConnection'even signifies.- Parameters:
url- This may be an InternetURL. The site and page to which it points should return data encoded in theUTF-8charset.- Returns:
- A java
BufferedReaderfor retrieving the data from the internet connection. - Throws:
java.io.IOException- See Also:
USER_AGENT,USE_USER_AGENT,checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
-
openConnGetHeader_UTF8
public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_UTF8 (java.net.URL url) throws java.io.IOException
Opens aUTF8connection to aURL, and returns aBufferedReaderfor reading it, and also theHTTP Headerthat was returned by theHTTP Server.
UTF-8 Character-Encoding:
For all intents and purposes, Java's internalclass HttpURLConnectionwill handle any receivedUTF-8content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the@Deprecatedlist, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
Please note that any attempt to connect to, or retrieve,'.html'content from a web-server that is returning thecharset=UTF-8is done by the JRE using with ease since the Java primitive-typecharis a 16-bit type. Instead, the methodsopenConn(String)andopenConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
GZIP Compression:
It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instanceAmazonoccasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip'and'deflate') have been specified in theHTTP Headerthat is returned. If it is, the'.html' Filereceived will be decompressed first.
It should be of note that'gzip'and'deflate'are not the only compression algorithms that may be specified in anHTTP Header- there are others (tar-gzip, pack200, zStandard,etc...). If another compression algorithm is in use by a web-server for a specificURL, then manually selecting a decompression algorithm would be necessary.- Parameters:
url- This may be an InternetURL. The site and page to which it points should return data encoded in theUTF-8charet.- Returns:
- This shall return an instance of
class Ret2. The contents of the multiple return type are as follows:Ret2.a (BufferedReader)
ABufferedReaderthat shall retrieve the HTTP Response from theURLprovided to this method.
Ret2.b (java.util.Map)
An instance ofMap<String, List<String>>which will contain the HTTP Headers which are returned by theHTTP Serverassociated with theURLprovided to this method.ThisHTTP Headeris obtained from the Java methodHttpURLConnection.getHeaderFields()
- Throws:
java.io.IOException- See Also:
checkHTTPCompression(Map, InputStream)- Code:
- Exact Method Body:
HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8"))), httpHeaders );
-
scrapePage
public static java.lang.String scrapePage(java.lang.String url) throws java.io.IOException
Convenience Method
Invokes:scrapePage(BufferedReader)
Obtains:BufferedReaderfromopenConn(String)- Code:
- Exact Method Body:
return scrapePage(openConn(url));
-
scrapePage
public static java.lang.String scrapePage(java.net.URL url) throws java.io.IOException
- Code:
- Exact Method Body:
return scrapePage(openConn(url));
-
scrapePage
public static java.lang.String scrapePage(java.io.BufferedReader br) throws java.io.IOException
This scrapes a website and dumps the entire contents into ajava.lang.String.- Parameters:
br- This is aReaderthat needs to have been connected to a Website that will output text/html data.- Returns:
- The text/html data - returned inside a
String - Throws:
java.io.IOException- Code:
- Exact Method Body:
StringBuffer sb = new StringBuffer(); String s; while ((s = br.readLine()) != null) sb.append(s + "\n"); return sb.toString();
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.lang.String url, boolean includeNewLine) throws java.io.IOException
Convenience Method
Invokes:scrapePageToVector(BufferedReader, boolean)
Obtains:BufferedReaderfromopenConn(String)- Code:
- Exact Method Body:
return scrapePageToVector(openConn(url), includeNewLine);
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.net.URL url, boolean includeNewLine) throws java.io.IOException
Convenience Method
Invokes:scrapePageToVector(BufferedReader, boolean)
Obtains:BufferedeaderfromopenConn(URL)- Code:
- Exact Method Body:
return scrapePageToVector(openConn(url), includeNewLine);
-
scrapePageToVector
public static java.util.Vector<java.lang.String> scrapePageToVector (java.io.BufferedReader br, boolean includeNewLine) throws java.io.IOException
This will scrape the entire contents of an HTML page to aVector<String>Each line of the text/HTML page is demarcated by the reception of a'\n'character from the web-server.- Parameters:
br- This is the input source of the HTML page. It will query for String data.includeNewLine- This will append the'\n'character to the end of eachStringin theVector.- Returns:
- a
VectorofString'swhere eachStringis a line on the web-page. - Throws:
java.io.IOException- See Also:
scrapePageToVector(String, boolean)- Code:
- Exact Method Body:
Vector<String> ret = new Vector<>(); String s = null; if (includeNewLine) while ((s = br.readLine()) != null) ret.add(s + '\n'); else while ((s = br.readLine()) != null) ret.add(s); return ret;
-
getHTML
public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException
This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as aString.This is called "scraping HTML."- Parameters:
startTag- If this is null, the scrape will begin with the first character received. If this contains aString, the scrape will not include any text/HTML data that occurs prior to the first occurrence of'startTag'endTag- If this is null, the scrape will read the entire contents of text/HTML data from theBufferedreader brparameter. If this contains aString, then data will be read and included in the result until'endTag'is received.- Returns:
- a
StringBufferthat is text/html data retrieved from the Reader. CalltoString()on the return value to retrieve thatString. - Throws:
ScrapeException- If, after download completes, either the'startTag'or the parameter'endTag'do not representString'sthat were found within the downloaded page, this exception is thrown.java.io.IOException- Code:
- Exact Method Body:
StringBuffer html = new StringBuffer(); String s; // Nice Long Name... Guess what it means boolean alreadyFoundEndTagInStartTagLine = false; // If the startTag parameter is not null, skip all content, until the startTag is found! if (startTag != null) { boolean foundStartTag = false; while ((s = br.readLine()) != null) if (s.contains(startTag)) { int startTagPos = s.indexOf(startTag); foundStartTag = true; // NOTE: Sometimes the 'startTag' and 'endTag' are on the same line! // This happens, for instance, on Yahoo Photos, when giant lines // (no line-breaks) are transmitted // Hence... *really* long variable name, this is confusing! s = s.substring(startTagPos); if ((endTag != null) && s.contains(endTag)) { s = s.substring(0, s.indexOf(endTag) + endTag.length()); alreadyFoundEndTagInStartTagLine = true; } html.append(s + "\n"); break; } if (! foundStartTag) throw new ScrapeException ("Start Tag: '" + startTag + "' was Not Found on Page."); } // if the endTag parameter is not null, stop reading as soon as the end-tag is found if (endTag != null) { // NOTE: This 'if' is inside curly-braces, because there is an 'else' that "goes with" // the 'if' above... BUT NOT the following 'if' if (! alreadyFoundEndTagInStartTagLine) { boolean foundEndTag = false; while ((s = br.readLine()) != null) if (s.contains(endTag)) { foundEndTag = true; int endTagPos = s.indexOf(endTag); html.append(s.substring(0, endTagPos + endTag.length()) + "\n"); break; } else html.append(s + "\n"); if (! foundEndTag) throw new ScrapeException ("End Tag: '" + endTag + "' was Not Found on Page."); } } // ELSE: (endTag *was* null) ... read all content until EOF ... or ... "EOWP" (end of web-page) else while ((s = br.readLine()) != null) html.append(s + "\n"); // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;
-
getHTML
public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, int startLineNum, int endLineNum) throws java.io.IOException
This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as aString.This is called "scraping HTML."- Parameters:
startLineNum- If this is'0'or'1', the scrape will begin with the first character received. If this contains a positive integer, the scrape will not include any text/HTML data that occurs prior toint startLineNumlines of text/html having been received.endLineNum- If this is negative, the scrape will read the entire contents of text/HTML data from theBufferedreader brparameter (untilEOFis encountered). If this contains a positive integer, then data will be read and included in the result untilint endLineNumlines of text/html have been received.- Returns:
- a
StringBufferthat is text/html data retrieved from the Reader. CalltoString()on the return value to retrieve thatString - Throws:
java.lang.IllegalArgumentException- If parameter'startLineNum'is negative or greater than'endLineNum'If'endLineNum'was negative, this test is skipped.ScrapeException- If there were not enough lines read from theBufferedReaderparameter to be consistent with the values in'startLineNum'and'endLineNum'java.io.IOException- Code:
- Exact Method Body:
StringBuffer html = new StringBuffer(); String s = ""; // NOTE: Arrays start at 0, **BUT** HTML page line counts start at 1! int curLineNum = 1; if (startLineNum < 0) throw new IllegalArgumentException( "The parameter startLineNum is negative: " + startLineNum + " but this is not " + "allowed." ); if (endLineNum == 0) throw new IllegalArgumentException ("The parameter endLineNum is zero, but this is not allowed."); endLineNum = (endLineNum < 0) ? 1 : endLineNum; startLineNum = (startLineNum == 0) ? 1 : startLineNum; if ((endLineNum < startLineNum) && (endLineNum != 1)) throw new IllegalArgumentException( "The parameter startLineNum is: " + startLineNum + "\n" + "The parameter endLineNum is: " + endLineNum + "\n" + "It is required that the latter is larger than the former, " + "or it must be 0 or negative to signify read until EOF." ); if (startLineNum > 1) { while (curLineNum++ < startLineNum) if (br.readLine() == null) throw new ScrapeException( "The HTML Page that was given didn't even have enough lines to read " + "quantity in variable startLineNum.\nstartLineNum = " + startLineNum + " and read " + (curLineNum-1) + " line(s) before EOF." ); // Off-By-One computer science error correction - remember post-decrement, means the // last loop iteration didn't read line, but did increment the loop counter! curLineNum--; } // endLineNum==1 means/imples that we don't have to heed the // endLineNum variable ==> read to EOF/null! if (endLineNum == 1) while ((s = br.readLine()) != null) html.append(s + "\n"); // endLineNum > 1 ==> Head endLineNum variable! else { // System.out.println("At START of LOOP: curLineNum = " + curLineNum + // " and endLineNum = " + endLineNum); for ( ;curLineNum <= endLineNum; curLineNum++) if ((s = br.readLine()) != null) html.append(s + "\n"); else break; // NOTE: curLineNum-1 and endLineNum+1 are used because: // // ** The loop counter (curLineNum) breaks when the next line to read is the one // passed the endLineNum // ** endLineNum+1 is the appropriate state if enough lines were read from the // HTML Page // ** curLineNum-1 is the number of the last line read from the HTML if (curLineNum != (endLineNum+1)) throw new ScrapeException( "The HTML Page that was read didn't have enough lines to read to quantity in " + "variable endLineNum.\nendLineNum = " + endLineNum + " but only read " + (curLineNum-1) + " line(s) before EOF." ); } // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;
-
-