Package Torello.HTML

Class Scrape


  • public class Scrape
    extends java.lang.Object
    Some standard utilities for transfering & downloading HTML from web-sites and then storing that content in memory as a Java String - which, subsequently, can be written to disk, transferred elsewhere, or even parsed (using class HTMLPage).

    This class just simplifies some of the typing for common Java Network Connection / HTTP Connection code.

    The openConn(args) methods open different types of connections to web-servers.

    NOTE: It is important to note what the major differences between web-connections are. If a user is receiving simple-ASCII - these connections will leave out 100% of "higher order" UTF-8 characters (many of which are foreign language characters). Often-times a usual java web-connection method will suffice, but not always. If a website does not have any computer-codes above ASCII 255, then the usual BufferedReader connection is just fine. UTF-8 however, is a very commonly used internet connection character-set protocol. It includes everything from Spanish Accent Characters to many Chinese Mandarin Characters and tens of thousands of other characters which are all listed on the UTF-8 specifications.

    The iso_8859_1 version I was forced to use once for a site from Spain involving the famous book by Cervantes, although I'm not completely certain how this standard works - and have only been expected to use this connection type twice. UTF-8, on the other-hand is used on 70% of the websites that I have parsed.


Stateless Class: This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member field. It is very similar to the Java-Bean @Stateless Annotation.
  • 1 Constructor(s), 1 declared private, zero-argument constructor
  • 21 Method(s), 21 declared static
  • 2 Field(s), 2 declared static, 0 declared final
  • Fields excused from final modifier (with explanation):
    Field 'USER_AGENT' is not final. Reason: CONFIGURATION
    Field 'USE_USER_AGENT' is not final. Reason: FLAG


    • Method Summary

       
      Open HTTP Connection, Get Reader
      Modifier and Type Method
      static BufferedReader openConn​(String url)
      static BufferedReader openConn​(URL url)
      static BufferedReader openConn_iso_8859_1​(String url)
      static BufferedReader openConn_iso_8859_1​(URL url)
      static BufferedReader openConn_UTF8​(String url)
      static BufferedReader openConn_UTF8​(URL url)
       
      Open HTTP Connection, Get Reader & Headers
      Modifier and Type Method
      static Ret2<BufferedReader,
           ​Map<String,
           ​List<String>>>
      openConnGetHeader​(URL url)
      static Ret2<BufferedReader,
           ​Map<String,
           ​List<String>>>
      openConnGetHeader_iso_8859_1​(URL url)
      static Ret2<BufferedReader,
           ​Map<String,
           ​List<String>>>
      openConnGetHeader_UTF8​(URL url)
       
      Read / Scrape Contents to String
      Modifier and Type Method
      static String scrapePage​(BufferedReader br)
      static String scrapePage​(String url)
      static String scrapePage​(URL url)
       
      Read / Scrape Contents to Vector<String>
      Modifier and Type Method
      static Vector<String> scrapePageToVector​(BufferedReader br, boolean includeNewLine)
      static Vector<String> scrapePageToVector​(String url, boolean includeNewLine)
      static Vector<String> scrapePageToVector​(URL url, boolean includeNewLine)
       
      Read / Scrape Contents to StringBuffer, Range-Limited
      Modifier and Type Method
      static StringBuffer getHTML​(BufferedReader br, int startLineNum, int endLineNum)
      static StringBuffer getHTML​(BufferedReader br, String startTag, String endTag)
       
      HTTP Header Methods
      Modifier and Type Method
      static InputStream checkHTTPCompression​(Map<String,​List<String>> httpHeaders, InputStream is)
      static String httpHeadersToString​(Map<String,​List<String>> httpHeaders)
      static boolean usesDeflate​(Map<String,​List<String>> httpHeaders)
      static boolean usesGZIP​(Map<String,​List<String>> httpHeaders)
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • USER_AGENT

        🡇    
        public static java.lang.String USER_AGENT
        When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";

        NOTE: This behavior may be changed by modifying these public static variables.

        ALSO: If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used at all.
        Code:
        Exact Field Declaration Expression:
        public static String USER_AGENT = "Chrome/61.0.3163.100";
        
      • USE_USER_AGENT

        🡅  🡇    
        public static boolean USE_USER_AGENT
        When opening an HTTP URL connection, it is usually a good idea to use a "User Agent" The default behavior in this Scrape & Search Package is to connect using the public static String USER_AGENT = "Chrome/61.0.3163.100";

        NOTE: This behavior may be changed by modifying these public static variables.

        ALSO: If the boolean USE_USER_AGENT is set to FALSE, then no User-Agent will be used
        Code:
        Exact Field Declaration Expression:
        public static boolean USE_USER_AGENT = true;
        
    • Method Detail

      • usesGZIP

        🡅  🡇    
        public static boolean usesGZIP​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method will check whether the HTTP Header returned by a website has been encoded using the GZIP Compression encoding. It expects the java.util.Map that is returned from an invocation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "gzip", then this method will return TRUE. Otherwise this method will return FALSE.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Code:
        Exact Method Body:
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
        
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "GZIP".  If this is found, return TRUE immediately.
            
                 for (String vals : httpHeaders.get(prop))
                     if (vals.equalsIgnoreCase("gzip")) return true;
        
         // The property-value "GZIP" wasn't found, so return FALSE.
         return false;
        
      • usesDeflate

        🡅  🡇    
        public static boolean usesDeflate​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method will check whether the HTTP Header returned by a website has been encoded using the ZIP Compression (PKZIP, Deflate) encoding. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to "deflate", then this method will return TRUE. Otherwise this method will return FALSE.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Code:
        Exact Method Body:
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
        
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "DEFLATE".  If this is found, return TRUE immediately.
            
                 for (String vals : httpHeaders.get(prop))
                     if (vals.equalsIgnoreCase("deflate")) return true;
        
         // The property-value "deflate" wasn't found, so return FALSE.
         return false;
        
      • checkHTTPCompression

        🡅  🡇    
        public static java.io.InputStream checkHTTPCompression​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders,
                     java.io.InputStream is)
                throws java.io.IOException
        
        This method will check whether the HTTP Header returned by a website has been encoded using compression. It expects the java.util.Map that is returned from an invokation of HttpURLConnection.getHeaderFields().
        Parameters:
        httpHeaders - This is a simply java.util.Map<String, List<String>>. It must be the exact map that is returned by the HttpURLConnection.
        is - This should be the InputStream that is returned from the HttpURLConnection when reqesting the content from the web-server that is hosting the URL. The HTTP Headers will be searched, and if a compression algorithm has been specified (and the algorithm is one of the algorithm's automatically handled by Java) - then this InputStream shall be wrapped by the appropriate decompression algorithm.
        Returns:
        If this map contains a property named "Content-Encoding" AND this property has a property-value in it's list equal to either "deflate" or "gzip", then this shall return a wrapped InputStream that is capable of handling the decompression algorithm.

        NOTE: Since HTTP Headers are considered CASE INSENSITIVE, all String comparisons done in this method shall ignore case.
        Throws:
        java.io.IOException
        Code:
        Exact Method Body:
         // NOTE: HTTP Headers are CASE-INSENSITIVE, so a loop is needed to check if
         //       certain values are present - rather than the (more simple) Map.containsKey(...)
        
         for (String prop : httpHeaders.keySet())
        
             // Check (Case Insensitive) if the HTTP Headers Map has the property "Content-Encoding"
             // NOTE: The Map's returned have been known to contain null keys, so check for that here.
        
             if ((prop != null) && prop.equalsIgnoreCase("Content-Encoding"))
        
                 // Check (Case Insensitive), if any of the properties assigned to "Content-Encoding"
                 // is "DEFLATE" or "GZIP".  If so, return the compression-algorithm immediately.
            
                 for (String vals : httpHeaders.get(prop))
        
                     if (vals.equalsIgnoreCase("gzip"))          return new GZIPInputStream(is);
                     else if (vals.equalsIgnoreCase("deflate"))  return new ZipInputStream(is);
        
         // Neither of the property-values "gzip" or "deflate" were found.
         // Return the original input stream.
         return is;
        
      • httpHeadersToString

        🡅  🡇    
        public static java.lang.String httpHeadersToString​
                    (java.util.Map<java.lang.String,​java.util.List<java.lang.String>> httpHeaders)
        
        This method shall simply take as input a java.util.Map which contains the HTTP Header properties that must have been generated by a call to the method HttpURLConnection.getHeaderFields(). It will produce a Java String that lists these headers in text / readable format.
        Parameters:
        httpHeaders - This parameter must be an instance of java.util.Map<String, List<String>> and it should have been generated by a call to HttpURLConnection.getHeaderFields(). The property names and values contained by this Map will be iterated and printed to a returned java.lang.String.
        Returns:
        This shall return a printed version of the Map.
        Code:
        Exact Method Body:
         StringBuilder   sb  = new StringBuilder();
         int             max = 0;
        
         // To ensure that the output string is "aligned", check the length of each of the
         // keys in the HTTP Header.
        
         for (String key : httpHeaders.keySet()) if (key.length() > max) max = key.length();
        
         max += 5;
        
         // Iterate all of the Properties that are included in the 'httpHeaders' parameter
         // It is important to note that the java "toString()" method for the List<String> that
         // is used to store the Property-Values list works great, without any changes.
        
         for (String key : httpHeaders.keySet()) sb.append(
             StringParse.rightSpacePad(key + ':', max) +
             httpHeaders.get(key).toString() + '\n'
         );
        
         return sb.toString();
        
      • openConn

        🡅  🡇    
        public static java.io.BufferedReader openConn​(java.lang.String url)
                                               throws java.io.IOException
        Convenience Method
        Invokes: openConn(URL)
        Code:
        Exact Method Body:
         return openConn(new URL(url));
        
      • openConn

        🡅  🡇    
        public static java.io.BufferedReader openConn​(java.net.URL url)
                                               throws java.io.IOException
        Opens a standard connection to a URL, and returns a BufferedReader for reading from it.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet-URL.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is));
        
      • openConnGetHeader

        🡅  🡇    
        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         Map<String, List<String>> httpHeaders = con.getHeaderFields();
        
         InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>
             (new BufferedReader(new InputStreamReader(is)), httpHeaders);
        
      • openConn_iso_8859_1

        🡅  🡇    
        public static java.io.BufferedReader openConn_iso_8859_1​
                    (java.net.URL url)
                throws java.io.IOException
        
        Will open an ISO-8859 connection to a URL, and returns a BufferedReader for reading it.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859 charset.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         con.setRequestProperty("Content-Type", "text/html; charset=iso-8859-1");
        
         InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is, Charset.forName("iso-8859-1")));
        
      • openConnGetHeader_iso_8859_1

        🡅  🡇    
        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader_iso_8859_1​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a ISO-8859-1 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the ISO-8859-1 charset.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         con.setRequestProperty("Content-Type", "charset=iso-8859-1");
        
         Map<String, List<String>> httpHeaders = con.getHeaderFields();
        
         InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>(
             new BufferedReader(new InputStreamReader(is, Charset.forName("charset=iso-8859-1"))),
             httpHeaders
         );
        
      • openConn_UTF8

        🡅  🡇    
        public static java.io.BufferedReader openConn_UTF8​(java.lang.String url)
                                                    throws java.io.IOException
        Convenience Method
        Invokes: openConn_UTF8(URL).
        Code:
        Exact Method Body:
         return openConn_UTF8(new URL(url));
        
      • openConn_UTF8

        🡅  🡇    
        public static java.io.BufferedReader openConn_UTF8​(java.net.URL url)
                                                    throws java.io.IOException
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it.

        UTF-8 NOTE: For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.

        Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.

        NOTE: The inclusion of the "User Agent" field in this URL Connection can be controlled from two public static fields at the top of this class. Being able to identify how one web-server will response to a different "Browser User Agent" is well beyond the scope of these documentation notes, and is the subject of the "browser war." Using them is not mandatory, and "which browser is being used" is all the 'USER_AGENT' field of an 'HttpURLConnection' even signifies.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charset.
        Returns:
        A java BufferedReader for retrieving the data from the internet connection.
        Throws:
        java.io.IOException
        See Also:
        USER_AGENT, USE_USER_AGENT, checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         con.setRequestProperty("Content-Type", "charset=UTF-8");
        
         InputStream is = checkHTTPCompression(con.getHeaderFields(), con.getInputStream());
        
         return new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
        
      • openConnGetHeader_UTF8

        🡅  🡇    
        public static Ret2<java.io.BufferedReader,​java.util.Map<java.lang.String,​java.util.List<java.lang.String>>> openConnGetHeader_UTF8​
                    (java.net.URL url)
                throws java.io.IOException
        
        Opens a UTF8 connection to a URL, and returns a BufferedReader for reading it, and also the HTTP Header that was returned by the HTTP Server.

        UTF-8 NOTE: For all intents and purposes, Java's internal class HttpURLConnection will handle any received UTF-8 content automatically. What this, sort of, means is that this method you are looking at right now is largely "unnecessary". It probably should be placed on the @Deprecated list, just in the case that a bizarre or unforeseen situation arises where this method could be used as a reference, it shall remain here.

        Please note that any attempt to connect to, or retrieve, '.html' content from a web-server that is returning the charset=UTF-8 is done by the JRE using with ease since the Java primitive-type char is a 16-bit type. Instead, the methods openConn(String) and openConn(URL), etc... (without the "UTF8" appended to the method name) should suffice for making such connections. It should make no difference.

        GZIP Note: It is uncommon to find compressed content on many of the major web-sites or web-hubs that are prevalent on the internet today. There are some where one may find a zipped page, for instance Amazon occasionally returns zipped pages. This method will check if one of the ZIP Format's ('gzip' and 'deflate') have been specified in the HTTP Header that is returned. If it is, the '.html' File received will be decompressed first.

        It should be of note that 'gzip' and 'deflate' are not the only compression algorithms that may be specified in an HTTP Header - there are others (tar-gzip, pack200, zStandard, etc...). If another compression algorithm is in use by a web-server for a specific URL, then manually selecting a decompression algorithm would be necessary.
        Parameters:
        url - This may be an Internet URL. The site and page to which it points should return data encoded in the UTF-8 charet.
        Returns:
        This shall return an instance of class Ret2. The contents of the multiple return type are as follows:

        • Ret2.a (BufferedReader)

          A BufferedReader that shall retrieve the HTTP Response from the URL provided to this method.

        • Ret2.b (java.util.Map)

          An instance of Map<String, List<String>> which will contain the HTTP Headers which are returned by the HTTP Server associated with the URL provided to this method.

          NOTE: This HTTP Header is obtained from the Java method HttpURLConnection.getHeaderFields()
        Throws:
        java.io.IOException
        See Also:
        checkHTTPCompression(Map, InputStream)
        Code:
        Exact Method Body:
         HttpURLConnection con = (HttpURLConnection) url.openConnection();
        
         con.setRequestMethod("GET");
        
         if (USE_USER_AGENT) con.setRequestProperty("User-Agent", USER_AGENT);
        
         con.setRequestProperty("Content-Type", "charset=UTF-8");
        
         Map<String, List<String>> httpHeaders = con.getHeaderFields();
        
         InputStream is = checkHTTPCompression(httpHeaders, con.getInputStream());
        
         return new Ret2<BufferedReader, Map<String, List<String>>>(
             new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8"))),
             httpHeaders
         );
        
      • scrapePage

        🡅  🡇    
        public static java.lang.String scrapePage​(java.io.BufferedReader br)
                                           throws java.io.IOException
        This scrapes a website and dumps the entire contents into a java.lang.String.
        Parameters:
        br - This is a Reader that needs to have been connected to a Website that will output text/html data.
        Returns:
        The text/html data - returned inside a String
        Throws:
        java.io.IOException
        Code:
        Exact Method Body:
         StringBuffer sb = new StringBuffer();
         String s;
        
         while ((s = br.readLine()) != null) sb.append(s + "\n");
        
         return sb.toString();
        
      • scrapePageToVector

        🡅  🡇    
        public static java.util.Vector<java.lang.String> scrapePageToVector​
                    (java.io.BufferedReader br,
                     boolean includeNewLine)
                throws java.io.IOException
        
        This will scrape the entire contents of an HTML page to a Vector<String> Each line of the text/HTML page is demarcated by the reception of a '\n' character from the web-server.
        Parameters:
        br - This is the input source of the HTML page. It will query for String data.
        includeNewLine - This will append the '\n' character to the end of each String in the Vector.
        Returns:
        a Vector of String's where each String is a line on the web-page.
        Throws:
        java.io.IOException
        See Also:
        scrapePageToVector(String, boolean)
        Code:
        Exact Method Body:
         Vector<String>  ret = new Vector<>();
         String          s   = null;
        
         if (includeNewLine)
        
             while ((s = br.readLine()) != null)
                 ret.add(s + '\n');
        
         else
        
             while ((s = br.readLine()) != null)
                 ret.add(s);
        
         return ret;
        
      • getHTML

        🡅  🡇    
        public static java.lang.StringBuffer getHTML​(java.io.BufferedReader br,
                                                     java.lang.String startTag,
                                                     java.lang.String endTag)
                                              throws java.io.IOException
        This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
        Parameters:
        startTag - If this is null, the scrape will begin with the first character received. If this contains a String, the scrape will not include any text/HTML data that occurs prior to the first occurrence of 'startTag'
        endTag - If this is null, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter. If this contains a String, then data will be read and included in the result until 'endTag' is received.
        Returns:
        a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String.
        Throws:
        ScrapeException - If, after download completes, either the 'startTag' or the parameter 'endTag' do not represent String's that were found within the downloaded page, this exception is thrown.
        java.io.IOException
        Code:
        Exact Method Body:
         StringBuffer    html                                = new StringBuffer();
         String          s;
         boolean         alreadyFoundEndTagInStartTagLine    = false;
        
         // If the startTag parameter is not null, skip all content, until the startTag is found!
         if (startTag != null)
         {
             boolean foundStartTag = false;
        
             while ((s = br.readLine()) != null)
        
                 if (s.contains(startTag))
                 {
                     int startTagPos = s.indexOf(startTag);
        
                     foundStartTag = true;
        
                     // NOTE:    Sometimes the 'startTag' and 'endTag' are on the same line!
                     //          This happens, for instance, on Yahoo Photos, when giant lines
                     //          (no line-breaks) are transmitted
                     //          Hence... *really* long variable name, this is confusing!
        
                     s = s.substring(startTagPos);
        
                     if ((endTag != null) && s.contains(endTag))
                     {
                         s = s.substring(0, s.indexOf(endTag) + endTag.length());
        
                         alreadyFoundEndTagInStartTagLine = true;
                     }
        
                     html.append(s + "\n"); break;
                 }
        
             if (! foundStartTag) throw new ScrapeException
                 ("Start Tag: '" + startTag + "' was Not Found on Page.");
         }
        
         // if the endTag parameter is not null, stop reading as soon as the end-tag is found
         if (endTag != null)
         {
             // NOTE: This 'if' is inside curly-braces, because there is an 'else' that "goes with"
             // the 'if' above... BUT NOT the following 'if'
        
             if (! alreadyFoundEndTagInStartTagLine)
             {
                 boolean foundEndTag = false;
        
                 while ((s = br.readLine()) != null)
        
                     if (s.contains(endTag))
                     {
                         foundEndTag = true;
                         int endTagPos = s.indexOf(endTag);
                         html.append(s.substring(0, endTagPos + endTag.length()) + "\n");
                         break;
                     }
        
                     else html.append(s + "\n");
        
                 if (! foundEndTag) throw new ScrapeException
                     ("End Tag: '" + endTag + "' was Not Found on Page.");
             }
         }
        
         // ELSE: (endTag *was* null) ... read all content until EOF ... or ... "EOWP" (end of web-page)
         else
        
             while ((s = br.readLine()) != null)
                 html.append(s + "\n");
        
         // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added.
         return html;
        
      • getHTML

        🡅    
        public static java.lang.StringBuffer getHTML​(java.io.BufferedReader br,
                                                     int startLineNum,
                                                     int endLineNum)
                                              throws java.io.IOException
        This receives an input stream that is contains a pipe to a website that will produce HTML. The HTML is read from the website, and returned as a String. This is called "scraping HTML."
        Parameters:
        startLineNum - If this is '0' or '1', the scrape will begin with the first character received. If this contains a positive integer, the scrape will not include any text/HTML data that occurs prior to int startLineNum lines of text/html having been received.
        endLineNum - If this is negative, the scrape will read the entire contents of text/HTML data from the Bufferedreader br parameter (until EOF is encountered). If this contains a positive integer, then data will be read and included in the result until int endLineNum lines of text/html have been received.
        Returns:
        a StringBuffer that is text/html data retrieved from the Reader. Call toString() on the return value to retrieve that String
        Throws:
        java.lang.IllegalArgumentException - If parameter 'startLineNum' is negative or greater than 'endLineNum' If 'endLineNum' was negative, this test is skipped.
        ScrapeException - If there were not enough lines read from the BufferedReader parameter to be consistent with the values in 'startLineNum' and 'endLineNum'
        java.io.IOException
        Code:
        Exact Method Body:
         StringBuffer	html    = new StringBuffer();
         String			s       = "";
        
         // NOTE: Arrays start at 0, **BUT** HTML page line counts start at 1!
         int curLineNum = 1;
        
         if (startLineNum < 0) throw new IllegalArgumentException(
             "The parameter startLineNum is negative: " + startLineNum + " but this is not " +
             "allowed."
         );
        
         if (endLineNum == 0) throw new IllegalArgumentException
             ("The parameter endLineNum is zero, but this is not allowed.");
        
         endLineNum		= (endLineNum < 0) ? 1 : endLineNum;
         startLineNum	= (startLineNum == 0) ? 1 : startLineNum;
        
         if ((endLineNum < startLineNum) && (endLineNum != 1)) throw new IllegalArgumentException(
             "The parameter startLineNum is: " + startLineNum + "\n" +
             "The parameter endLineNum is: " + endLineNum + "\n" +
             "It is required that the latter is larger than the former, " +
             "or it must be 0 or negative to signify read until EOF."
         );
        
         if (startLineNum > 1)
         {
             while (curLineNum++ < startLineNum)
        
                 if (br.readLine() == null) throw new ScrapeException(
                     "The HTML Page that was given didn't even have enough lines to read " +
                     "quantity in variable startLineNum.\nstartLineNum = " + startLineNum + 
                     " and read " + (curLineNum-1) + " line(s) before EOF."
                 );
        
             // Off-By-One computer science error correction - remember post-decrement, means the
             // last loop iteration didn't read line, but did increment the loop counter!
        
             curLineNum--;
         }
        
         // endLineNum==1  means/imples that we don't have to heed the
         // endLineNum variable ==> read to EOF/null!
        
         if (endLineNum == 1)
        
             while ((s = br.readLine()) != null)
                 html.append(s + "\n");
        
         // endLineNum > 1 ==> Head endLineNum variable!
         else
         {
             // System.out.println("At START of LOOP: curLineNum = " + curLineNum +
             // " and endLineNum = " + endLineNum);
        
             for ( ;curLineNum <= endLineNum; curLineNum++)
        
                 if ((s = br.readLine()) != null) html.append(s + "\n");
                 else break;
        
             // NOTE: curLineNum-1 and endLineNum+1 are used because:
             //
             //		** The loop counter (curLineNum) breaks when the next line to read is the one
             //          passed the endLineNum
             //		** endLineNum+1 is the appropriate state if enough lines were read from the
             //           HTML Page
             //		** curLineNum-1 is the number of the last line read from the HTML
        
             if (curLineNum != (endLineNum+1)) throw new ScrapeException(
                 "The HTML Page that was read didn't have enough lines to read to quantity in " +
                 "variable endLineNum.\nendLineNum = " + endLineNum + " but only read " +
                 (curLineNum-1) + " line(s) before EOF."
             );
         }
        
         // Kind of an annoying line, but this is the new "Multi-Threaded" thing I added.
         return html;