java.lang.Object
- Torello.HTML.Scrape

```
public class Scrape
extends java.lang.Object
```
Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a Java String - which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using class HTMLPage). This class just simplifies some of the typing for common Java Network Connection / HTTP Connection code.

The openConn(args) methods open different types of connections to web-servers.

Connection Types:
It is important to note what the major differences between web-connections are. If a user is receiving simple-ASCII - these connections will leave out 100% of the "higher order" UTF-8 characters (many of which are foreign language characters).

Often-times a usual java web-connection method will suffice, but not always. If a website does not use any characters that range above ASCII 255, then the usual BufferedReader connection is just fine.

UTF-8 however, is a very commonly used internet connection character-set protocol. It includes everything from Spanish Accent Characters to many Chinese Mandarin Characters and tens of thousands of other characters which are all listed in the UTF-8 specifications.

The iso_8859_1 version I was forced to use once for a site from Spain involving the famous book by Cervantes, although I'm not completely certain how this standard works - and have only been expected to use this connection type twice. UTF-8, on the other-hand is used on 70% of the websites that I have parsed.
Hi-Lited Source-Code:
- View Here: Torello/HTML/Scrape.java
- Open New Browser-Tab: Torello/HTML/Scrape.java
File Size: 33,015 Bytes Line Count: 773 '\n' Characters Found
Stateless Class:
This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 21 Method(s), 21 declared static
- 2 Field(s), 2 declared static, 0 declared final
- Fields excused from final modifier (with explanation):
  
  Field 'USER_AGENT' is not final. Reason: CONFIGURATION
  
  Field 'USE_USER_AGENT' is not final. Reason: FLAG

Field Summary

Fields
Modifier and Type Field

static boolean USE_USER_AGENT

static String USER_AGENT

Method Summary

Open HTTP Connection, Get Reader

Modifier and Type	Method
`static BufferedReader`	`openConn(String url)`
`static BufferedReader`	`openConn(URL url)`
`static BufferedReader`	`openConn_iso_8859_1(String url)`
`static BufferedReader`	`openConn_iso_8859_1(URL url)`
`static BufferedReader`	`openConn_UTF8(String url)`
`static BufferedReader`	`openConn_UTF8(URL url)`

Open HTTP Connection, Get Reader & Headers
Modifier and Type	Method
`static Ret2<BufferedReader, Map<String, List<String>>>`	`openConnGetHeader(URL url)`
`static Ret2<BufferedReader, Map<String, List<String>>>`	`openConnGetHeader_iso_8859_1(URL url)`
`static Ret2<BufferedReader, Map<String, List<String>>>`	`openConnGetHeader_UTF8(URL url)`

Read / Scrape Contents to String
Modifier and Type	Method
`static String`	`scrapePage(BufferedReader br)`
`static String`	`scrapePage(String url)`
`static String`	`scrapePage(URL url)`

Read / Scrape Contents to Vector<String>
Modifier and Type	Method
`static Vector<String>`	`scrapePageToVector(BufferedReader br, boolean includeNewLine)`
`static Vector<String>`	`scrapePageToVector(String url, boolean includeNewLine)`
`static Vector<String>`	`scrapePageToVector(URL url, boolean includeNewLine)`

Read / Scrape Contents to StringBuffer, Range-Limited
Modifier and Type	Method
`static StringBuffer`	`getHTML(BufferedReader br, int startLineNum, int endLineNum)`
`static StringBuffer`	`getHTML(BufferedReader br, String startTag, String endTag)`

HTTP Header Methods
Modifier and Type	Method
`static InputStream`	`checkHTTPCompression(Map<String,List<String>> httpHeaders, InputStream is)`
`static String`	`httpHeadersToString(Map<String,List<String>> httpHeaders)`
`static boolean`	`usesDeflate(Map<String,List<String>> httpHeaders)`
`static boolean`	`usesGZIP(Map<String,List<String>> httpHeaders)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - USER_AGENT
    
    🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.String USER_AGENT
    
    When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";
    
    This behavior may be changed by modifying these public static variables.
    If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used at all.
    
    Code:
    
    Exact Field Declaration Expression:
    
    public static String USER_AGENT = "Chrome/61.0.3163.100";
  - USE_USER_AGENT
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static boolean USE_USER_AGENT
    
    When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";
    
    This behavior may be changed by modifying these public static variables.
    If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used at all.
    
    Code:
    
    Exact Field Declaration Expression:
    
    public static boolean USE_USER_AGENT = true;
- Method Detail
  - usesGZIP
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static boolean usesGZIP (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
    
    This method will check whether the HTTP Header returned by a website has been encoded using the GZIP Compression encoding. It expects the java.util.Map that is returned from an invocation of HttpURLConnection.getHeaderFields().
    
    Case-Insensitive:
    Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
    
    Parameters:
    
    httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
    
    Returns:
    
    If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "gzip", then this method will return TRUE. Otherwise this method will return FALSE.
    
    Code:
    
    Exact Method Body:
    
    // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The Map's returned have been known to contain null keys, so check for that here. if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding" // is "GZIP". If this is found, return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return true; // The property-value "GZIP" wasn't found, so return FALSE. return false;
  - usesDeflate
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static boolean usesDeflate (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
    
    This method will check whether the HTTP Header returned by a website has been encoded using the ZIP Compression (PKZIP, Deflate) encoding. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
    
    Parameters:
    
    httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
    
    Returns:
    
    If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "deflate", then this method will return TRUE. Otherwise this method will return FALSE.
    
    Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
    
    Code:
    
    Exact Method Body:
    
    // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" are // "DEFLATE" - then return TRUE immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("deflate")) return true; // The property-value "deflate" wasn't found, so return FALSE. return false;
  - checkHTTPCompression
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.InputStream checkHTTPCompression (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders, java.io.InputStream is) throws java.io.IOException
    
    This method will check whether the HTTP Header returned by a website has been encoded using compression. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
    
    Parameters:
    
    httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
    
    is - This should be the InputStream that is returned from the HttpURLConnection when reqesting the content from the web-server that is hosting the URL. The HTTP Headers will be searched, and if a compression algorithm has been specified (and the algorithm is one of the algorithm's automatically handled by Java) - then this InputStream shall be wrapped by the appropriate decompression algorithm.
    
    Returns:
    
    If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to either "deflate" or "gzip", then this shall return a wrapped InputStream that is capable of handling the decompression algorithm.
    
    Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
    
    Throws:
    
    java.io.IOException
    
    Code:
    
    Exact Method Body:
    
    // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if // certain values are present - rather than the (more simple) Map.containsKey(...) for (String prop : httpHeaders.keySet()) // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding" // NOTE: The returned Maps have been known to contain null keys, so check for that here if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding")) // Check (Case Insensitive), if any properties assigned to "Content-Encoding" // are "DEFLATE" or "GZIP" - then return the compression-algorithm immediately. for (String vals : httpHeaders.get(prop)) if (vals.equalsIgnoreCase("gzip")) return new GZIPInputStream(is); else if (vals.equalsIgnoreCase("deflate")) return new ZipInputStream(is); // Neither of the property-values "gzip" or "deflate" were found. // Return the original input stream. return is;
  - httpHeadersToString
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.String httpHeadersToString (java.util.Map<java.lang.String,java.util.List<java.lang.String>> httpHeaders)
    
    This method shall simply take as input a java.util.Map which contains the HTTP Header properties that must have been generated by a call to the method HttpURLConnection.getHeaderFields(). It will produce a Java String that lists these headers in text / readable format.
    
    Parameters:
    
    httpHeaders - This parameter must be an instance of java.util.Map<String, List<String>> and it should have been generated by a call to HttpURLConnection.getHeaderFields(). The property names and values contained by this Map will be iterated and printed to a returned java.lang.String.
    
    Returns:
    
    This shall return a printed version of the Map.
    
    Code:
    
    Exact Method Body:
    
    StringBuilder sb = new StringBuilder(); int max = 0; // To ensure that the output string is "aligned", check the length of each of the // keys in the HTTP Header. for (String key : httpHeaders.keySet()) if (key.length() > max) max = key.length(); max += 5; // Iterate all of the Properties that are included in the 'httpHeaders' parameter // It is important to note that the java "toString()" method for the List<String> that // is used to store the Property-Values list works great, without any changes. for (String key : httpHeaders.keySet()) sb.append( StringParse.rightSpacePad(key + ':', max) + httpHeaders.get(key).toString() + '\n' ); return sb.toString();
  - openConn
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn(java.lang.String url) throws java.io.IOException
    
    Convenience Method
    Invokes: openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return openConn(new URL(url));
  - openConn
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn(java.net.URL url) throws java.io.IOException
    
    Opens a standard connection to a URL, and returns a BufferedReader for reading from it.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Employing a User-Agent:
    The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
    
    Parameters:
    
    url - This may be an Internet-URL.
    
    Returns:
    
    A java BufferedReader for retrieving the data from the internet connection.
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is));
  - openConnGetHeader
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader (java.net.URL url) throws java.io.IOException
    
    Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Parameters:
    
    url - This may be an Internet URL.
    
    Returns:
    
    This shall return an instance of class Ret2. The contents of the multiple return type are as follows:
    
    Ret2.a (BufferedReader)
    
    A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.
    
    Ret2.b (java.util.Map)
    
    An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.
    
    This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>> (new BufferedReader(new InputStreamReader(is)), httpHeaders);
  - openConn_iso_8859_1
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn_iso_8859_1 (java.lang.String url) throws java.io.IOException
    
    Convenience Method
    Invokes: openConn_iso_8859_1(URL)
    
    Code:
    
    Exact Method Body:
    
    return openConn_iso_8859_1(new URL(url));
  - openConn_iso_8859_1
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn_iso_8859_1 (java.net.URL url) throws java.io.IOException
    
    Will open an ISO-8859 connection to a URL, and returns a BufferedReader for reading it.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Employing a User-Agent:
    The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
    
    Parameters:
    
    url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859 charset.
    
    Returns:
    
    A java BufferedReader for retrieving the data from the internet connection.
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "text/html; charset=iso-8859-1"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("iso-8859-1")));
  - openConnGetHeader_iso_8859_1
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_iso_8859_1 (java.net.URL url) throws java.io.IOException
    
    Opens a ISO-8859-1 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Parameters:
    
    url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859-1 charset.
    
    Returns:
    
    This shall return an instance of class Ret2. The contents of the multiple return type are as follows:
    
    Ret2.a (BufferedReader)
    
    A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.
    
    Ret2.b (java.util.Map)
    
    An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.
    
    This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=iso-8859-1"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("charset=iso-8859-1"))), httpHeaders );
  - openConn_UTF8
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn_UTF8(java.lang.String url) throws java.io.IOException
    
    Convenience Method
    Invokes: openConn_UTF8(URL).
    
    Code:
    
    Exact Method Body:
    
    return openConn_UTF8(new URL(url));
  - openConn_UTF8
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.io.BufferedReader openConn_UTF8(java.net.URL url) throws java.io.IOException
    
    Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it.
    
    UTF-8 Character-Encoding:
    For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
    
    Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Employing a User-Agent:
    The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
    
    Parameters:
    
    url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charset.
    
    Returns:
    
    A java BufferedReader for retrieving the data from the internet connection.
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream()); return new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
  - openConnGetHeader_UTF8
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static Ret2<java.io.BufferedReader,java.util.Map<java.lang.String,java.util.List<java.lang.String>>> openConnGetHeader_UTF8 (java.net.URL url) throws java.io.IOException
    
    Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.
    
    UTF-8 Character-Encoding:
    For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.
    
    Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.
    
    GZIP Compression:
    It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.
    
    It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
    
    Parameters:
    
    url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charet.
    
    Returns:
    
    This shall return an instance of class Ret2. The contents of the multiple return type are as follows:
    
    Ret2.a (BufferedReader)
    
    A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.
    
    Ret2.b (java.util.Map)
    
    An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.
    
    This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    checkHTTPCompression(Map, InputStream)
    
    Code:
    
    Exact Method Body:
    
    HttpURLConnection con = (HttpURLConnection) url.openConnection(); con.setRequestMethod("GET"); if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT); con.setRequestProperty("Content-Type", "charset=UTF-8"); Map<String, List<String>> httpHeaders = con.getHeaderFields(); InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream()); return new Ret2<BufferedReader, Map<String, List<String>>>( new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8"))), httpHeaders );
  - scrapePage
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.String scrapePage(java.lang.String url) throws java.io.IOException
    
    Convenience Method
    Invokes: scrapePage(BufferedReader)
    Obtains: BufferedReader from openConn(String)
    
    Code:
    
    Exact Method Body:
    
    return scrapePage(openConn(url));
  - scrapePage
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.String scrapePage(java.net.URL url) throws java.io.IOException
    
    Convenience Method
    Invokes: scrapePage(BufferedReader)
    Obtains: BufferedReader from openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return scrapePage(openConn(url));
  - scrapePage
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.String scrapePage(java.io.BufferedReader br) throws java.io.IOException
    
    This scrapes a website and dumps the entire contents into a java.lang.String.
    
    Parameters:
    
    br - This is a Reader that needs to have been connected to a Website that will output text/html data.
    
    Returns:
    
    The text/html data - returned inside a String
    
    Throws:
    
    java.io.IOException
    
    Code:
    
    Exact Method Body:
    
    StringBuffer sb = new StringBuffer(); String s; while ((s = br.readLine()) != null) sb.append(s + "\n"); return sb.toString();
  - scrapePageToVector
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<java.lang.String> scrapePageToVector (java.lang.String url, boolean includeNewLine) throws java.io.IOException
    
    Convenience Method
    Invokes: scrapePageToVector(BufferedReader, boolean)
    Obtains: BufferedReader from openConn(String)
    
    Code:
    
    Exact Method Body:
    
    return scrapePageToVector(openConn(url), includeNewLine);
  - scrapePageToVector
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<java.lang.String> scrapePageToVector (java.net.URL url, boolean includeNewLine) throws java.io.IOException
    
    Convenience Method
    Invokes: scrapePageToVector(BufferedReader, boolean)
    Obtains: Bufferedeader from openConn(URL)
    
    Code:
    
    Exact Method Body:
    
    return scrapePageToVector(openConn(url), includeNewLine);
  - scrapePageToVector
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.util.Vector<java.lang.String> scrapePageToVector (java.io.BufferedReader br, boolean includeNewLine) throws java.io.IOException
    
    This will scrape the entire contents of an HTML page to a Vector<String> Each line of the text/HTML page is demarcated by the reception of a '\n' character from the web-server.
    
    Parameters:
    
    br - This is the input source of the HTML page. It will query for String data.
    
    includeNewLine - This will append the '\n' character to the end of each String in the Vector.
    
    Returns:
    
    a Vector of String's where each String is a line on the web-page.
    
    Throws:
    
    java.io.IOException
    
    See Also:
    
    scrapePageToVector(String, boolean)
    
    Code:
    
    Exact Method Body:
    
    Vector<String> ret = new Vector<>(); String s = null; if (includeNewLine) while ((s = br.readLine()) != null) ret.add(s + '\n'); else while ((s = br.readLine()) != null) ret.add(s); return ret;
  - getHTML
    
    🡅 🡇 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, java.lang.String startTag, java.lang.String endTag) throws java.io.IOException
    
    This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
    
    Parameters:
    
    startTag - If this is null, the scrape will begin with the first character received. If this contains a String, the scrape will not include any text/HTML data that occurs prior to the first occurrence of 'startTag'
    
    endTag - If this is null, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter. If this contains a String, then data will be read and included in the result until 'endTag' is received.
    
    Returns:
    
    a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String.
    
    Throws:
    
    ScrapeException - If, after download completes, either the 'startTag' or the parameter 'endTag' do not represent String's that were found within the downloaded page, this exception is thrown.
    
    java.io.IOException
    
    Code:
    
    Exact Method Body:
    
    StringBuffer html = new StringBuffer(); String s; // Nice Long Name... Guess what it means boolean alreadyFoundEndTagInStartTagLine = false; // If the startTag parameter is not null, skip all content, until the startTag is found! if (startTag != null) { boolean foundStartTag = false; while ((s = br.readLine()) != null) if (s.contains(startTag)) { int startTagPos = s.indexOf(startTag); foundStartTag = true; // NOTE: Sometimes the 'startTag' and 'endTag' are on the same line! // This happens, for instance, on Yahoo Photos, when giant lines // (no line-breaks) are transmitted // Hence... *really* long variable name, this is confusing! s = s.substring(startTagPos); if ((endTag != null) && s.contains(endTag)) { s = s.substring(0, s.indexOf(endTag) + endTag.length()); alreadyFoundEndTagInStartTagLine = true; } html.append(s + "\n"); break; } if (! foundStartTag) throw new ScrapeException ("Start Tag: '" + startTag + "' was Not Found on Page."); } // if the endTag parameter is not null, stop reading as soon as the end-tag is found if (endTag != null) { // NOTE: This 'if' is inside curly-braces, because there is an 'else' that "goes with" // the 'if' above... BUT NOT the following 'if' if (! alreadyFoundEndTagInStartTagLine) { boolean foundEndTag = false; while ((s = br.readLine()) != null) if (s.contains(endTag)) { foundEndTag = true; int endTagPos = s.indexOf(endTag); html.append(s.substring(0, endTagPos + endTag.length()) + "\n"); break; } else html.append(s + "\n"); if (! foundEndTag) throw new ScrapeException ("End Tag: '" + endTag + "' was Not Found on Page."); } } // ELSE: (endTag *was* null) ... read all content until EOF ... or ... "EOWP" (end of web-page) else while ((s = br.readLine()) != null) html.append(s + "\n"); // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;
  - getHTML
    
    🡅 ⇈ ⮫ 🗕 🗗 🗖
    public static java.lang.StringBuffer getHTML(java.io.BufferedReader br, int startLineNum, int endLineNum) throws java.io.IOException
    
    This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
    
    Parameters:
    
    startLineNum - If this is '0' or '1', the scrape will begin with the first character received. If this contains a positive integer, the scrape will not include any text/HTML data that occurs prior to int startLineNum lines of text/html having been received.
    
    endLineNum - If this is negative, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter (until EOF is encountered). If this contains a positive integer, then data will be read and included in the result until int endLineNum lines of text/html have been received.
    
    Returns:
    
    a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String
    
    Throws:
    
    java.lang.IllegalArgumentException - If parameter 'startLineNum' is negative or greater than 'endLineNum' If 'endLineNum' was negative, this test is skipped.
    
    ScrapeException - If there were not enough lines read from the BufferedReader parameter to be consistent with the values in 'startLineNum' and 'endLineNum'
    
    java.io.IOException
    
    Code:
    
    Exact Method Body:
    
    StringBuffer html = new StringBuffer(); String s = ""; // NOTE: Arrays start at 0, **BUT** HTML page line counts start at 1! int curLineNum = 1; if (startLineNum < 0) throw new IllegalArgumentException( "The parameter startLineNum is negative: " + startLineNum + " but this is not " + "allowed." ); if (endLineNum == 0) throw new IllegalArgumentException ("The parameter endLineNum is zero, but this is not allowed."); endLineNum = (endLineNum < 0) ? 1 : endLineNum; startLineNum = (startLineNum == 0) ? 1 : startLineNum; if ((endLineNum < startLineNum) && (endLineNum != 1)) throw new IllegalArgumentException( "The parameter startLineNum is: " + startLineNum + "\n" + "The parameter endLineNum is: " + endLineNum + "\n" + "It is required that the latter is larger than the former, " + "or it must be 0 or negative to signify read until EOF." ); if (startLineNum > 1) { while (curLineNum++ < startLineNum) if (br.readLine() == null) throw new ScrapeException( "The HTML Page that was given didn't even have enough lines to read " + "quantity in variable startLineNum.\nstartLineNum = " + startLineNum + " and read " + (curLineNum-1) + " line(s) before EOF." ); // Off-By-One computer science error correction - remember post-decrement, means the // last loop iteration didn't read line, but did increment the loop counter! curLineNum--; } // endLineNum==1 means/imples that we don't have to heed the // endLineNum variable ==> read to EOF/null! if (endLineNum == 1) while ((s = br.readLine()) != null) html.append(s + "\n"); // endLineNum > 1 ==> Head endLineNum variable! else { // System.out.println("At START of LOOP: curLineNum = " + curLineNum + // " and endLineNum = " + endLineNum); for ( ;curLineNum <= endLineNum; curLineNum++) if ((s = br.readLine()) != null) html.append(s + "\n"); else break; // NOTE: curLineNum-1 and endLineNum+1 are used because: // // ** The loop counter (curLineNum) breaks when the next line to read is the one // passed the endLineNum // ** endLineNum+1 is the appropriate state if enough lines were read from the // HTML Page // ** curLineNum-1 is the number of the last line read from the HTML if (curLineNum != (endLineNum+1)) throw new ScrapeException( "The HTML Page that was read didn't have enough lines to read to quantity in " + "variable endLineNum.\nendLineNum = " + endLineNum + " but only read " + (curLineNum-1) + " line(s) before EOF." ); } // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added. return html;

Field 'USER_AGENT' is not final.	Reason: CONFIGURATION
Field 'USE_USER_AGENT' is not final.	Reason: FLAG

Fields
String	USER_AGENT
boolean	USE_USER_AGENT
Methods
InputStream	checkHTTPCompression(Map httpHeaders, InputStream is)
StringBuffer	getHTML(BufferedReader br, String startTag, String endTag)
StringBuffer	getHTML(BufferedReader br, int startLineNum, int endLineNum)
String	httpHeadersToString(Map httpHeaders)
BufferedReader	openConn(String url)
BufferedReader	openConn(URL url)
BufferedReader	openConn_iso_8859_1(String url)
BufferedReader	openConn_iso_8859_1(URL url)
BufferedReader	openConn_UTF8(String url)
BufferedReader	openConn_UTF8(URL url)
Ret2	openConnGetHeader(URL url)
Ret2	openConnGetHeader_iso_8859_1(URL url)
Ret2	openConnGetHeader_UTF8(URL url)
String	scrapePage(BufferedReader br)
String	scrapePage(String url)
String	scrapePage(URL url)
Vector	scrapePageToVector(BufferedReader br, boolean includeNewLine)
Vector	scrapePageToVector(String url, boolean includeNewLine)
Vector	scrapePageToVector(URL url, boolean includeNewLine)
boolean	usesDeflate(Map httpHeaders)
boolean	usesGZIP(Map httpHeaders)

Class Scrape

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

USER_AGENT

USE_USER_AGENT

Method Detail

usesGZIP

usesDeflate

checkHTTPCompression

httpHeadersToString

openConn

openConn

openConnGetHeader

openConn_iso_8859_1

openConn_iso_8859_1

openConnGetHeader_iso_8859_1

openConn_UTF8

openConn_UTF8

openConnGetHeader_UTF8

scrapePage

scrapePage

scrapePage

scrapePageToVector

scrapePageToVector

scrapePageToVector

getHTML

getHTML