Class GoogleQuery


  • public class GoogleQuery
    extends java.lang.Object
    GoogleQuery - Example class that makes an HTTP connection (rather than a REST invocation) to the Search-Engine.

    This class will return Google Search Results from a query to the Google Search Bar.

    IMPORTANT NOTE: As of September, 2020 - the writing of this class, Google Search Results are identified by the following:

    HTML Elements:
    <DIV CLASS="rc">            <!--    This divider contains / wraps the ENTIRE RESULT
                                        On most occasions, a single-page returned from asking
                                        the search bar will return PRECISELY 10 of these divider
                                        Elements.
                                -->
    
    <A HREF="..."><H3>...</H3>  <!--    This (first) Anchor 'A' Element and 'H3' Element contain the 
                                        actual link, and the link-text.  If this were to change with
                                        Google, this particular search-engine class would fail
                                -->
    </A>
    
    <DIV CLASS="s">             <!--    If there "sub-links" available, they are "wrapped" inside
                                        of a divider ('DIV') whose CLASS="s".
                                        There is possibly other info buried here.
                                -->
    </DIV>
    </DIV>
    


    THIS CLASS WILL NOT WORK WITHOUT STARTING THE SPLASH-SERVER Package Torello.HTML contains a simple - mostly documentation / informational class named Torello.HTML.SplashBridge. This server was not developed by the same organization as this JAR Java HTML library was. Though the Splash Server has myriad of features; primarily it is used here to execute the script found on a web-page. When querying the Search Bar, the pages that are returned are heavily laden with script and AJAX calls.

    In order to retrieve the HTML of the search being performed, the script calls that use AJAX, Java-Script, Type-Script, jQuery, React JS and Angular JS need to be completed. Currently, the most popular tool available for this task is the Selenium WebDriver package. It "hooks up directly" to an instance of Google Chrome and asks it to execute the / any script on a given web-page, before returning the HTML to be parsed. This can be useful, but since the tool is primarily marketed as a UI (User Interface) Testing Tool, much of the API is about performing button clicks and scroll-bar movements. Here, all that is needed to make sure that the HTML which is intended to be viewed when a page finishes loading is indeed loaded.

    This package has used, so far, successfully the Splash Tool to run the initial scripts that may be present on most web-pages.

    To start Splash, you must have access to the Docker Tool to install it on your machine. It is just a small web-server like piece of software. It will listen on port 8050, and act as a "Proxy" for calls to the Search Engine. When it receives a request for a URL, for instance, it will execute all available script on the page first, and then return the HTML to be parsed by this Java HTML Library

    UNIX or DOS Shell Command:
    Install Docker. Make sure Docker version 17 (or greater) is installed. sudo is for the UNIX command line (Super User Do), not for MS-DOS. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash

    This must be run and listening on port 8050 before the methods in this class can function.

    Microsoft Windows Users: Please review the class SplashBridge for information on using the Splash HTTP Server in a Windows Environment via the Docker Loading Program which has been ported to Microsoft Windows.



    Stateless Class:
    This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.
    • 1 Constructor(s), 1 declared private, zero-argument constructor
    • 3 Method(s), 3 declared static
    • 2 Field(s), 2 declared static, 1 declared final
    • Fields excused from final modifier (with explanation):
      Field 'SPLASH_URL' is not final. Reason: CONFIGURATION


    • Field Detail

      • SPLASH_URL

        🡇    
        public static java.lang.String SPLASH_URL
        In order to use "Splash" - it is *very simple* ... Start the Server, and append this String to the beginning of all URL's:
        Code:
        Exact Field Declaration Expression:
         public static String SPLASH_URL = "http://localhost:8050/render.html?url=";
        
    • Method Detail

      • main

        🡅  🡇    
        public static void main​(java.lang.String[] argv)
                         throws java.io.IOException
        This class may be invoked at the Command Line. The arguments passed to this class will be sent to the query(Appendable, String[]) method. The results will be printed to the terminal.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
         query(System.out, argv);
        
      • query

        🡅  🡇    
        public static Ret2<GoogleQuery.SearchResult[],​java.net.URL[]> query​
                    (java.lang.Appendable log,
                     java.lang.String... argv)
                throws java.io.IOException
        
        This will poll the nearest Google Web Server for the results of a search.

        IMPORANT: As explained at the top of this class, this method will not work if the Splash Server has not been installed and started on your computer.
        Parameters:
        log - This is the log parameter. If this parameter is null, it shall be ignored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        argv - This should be the list of keywords that would be typed into a Google Search Bar. There may be spaces here, so the entire search String may be sent as a single-String, or it may be broken up into tokens and passed individually here. It is mostly irrelevant.
        Returns:
        This shall return an instance of Ret2. The two elements that will make up the result will be as follows:

        1. Ret2.a (SearchResult[])

          This will be an array of SearchResult that (likely) contains several elements.

        2. Ret2.b (URL[])

          Search Engines return lists of next-pages in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
         StringBuilder queryBuilder = new StringBuilder();
        
         for (int i=0; i < argv.length; i++)
         {
             String temp = argv[i].replace("+", "%2B").replace(" ", "+");
        
             temp = StrReplace.r
                 (temp, URL_ESC_CHARS, (int t, char c) -> '%' + Integer.toHexString((int) c));
        
             queryBuilder.append(temp);
             if (i < (argv.length -1)) queryBuilder.append('+');
         }
        
         String queryStr = queryBuilder.toString();
        
         if (log != null) log.append("Query String:\n" + BYELLOW + queryStr + RESET + '\n');
        
         return query(log, new URL("https://google.com/search?q=" + queryStr));
        
      • query

        🡅    
        public static Ret2<GoogleQuery.SearchResult[],​java.net.URL[]> query​
                    (java.lang.Appendable log,
                     java.net.URL query)
                throws java.io.IOException
        
        This will poll the nearest Google Web Server for the results of a search - given a provided URL. The URL provided by this method ought to be one of the URL's retrieved from the "next" button - as was returned by a previous search engine query (the Ret2.b list of URL's).

        IMPORANT: As explained at the top of this class, this method will not work if the Splash Server has not been installed and started on your computer.

        Here is the "core HTML retrieve operation" for a Google Search Bar Result
         // Create a "Google Results Iterator" - each result is wrapped in an HTML Element
         // that looks like: <DIV CLASS="rc"> ... </DIV>
         HNLIInclusive resultsIter = InnerTagInclusiveIterator.get
              (v, "div", "class", TextComparitor.C, "rc");
         
         while (resultsIter.hasNext())
         
              // The first anchor <A HREF=...> will contain the link for this search-result.
              Vector<HTMLNode>    firstLink       = TagNodeGetInclusive.first(result, "a");
         
              // The first <H3>...</H3> will contain the link-text for this URL search-result.
              Vector<HTMLNode>    mainLinkText    = TagNodeGetInclusive.first(firstLink, "h3");
        
              // This is how the URL and Anchor text is retrieved
              String url      = ((TagNode) firstLink.elementAt(0)).AV("href").trim();
              String title    = Util.textNodesString(mainLinkText).trim();
        
        Parameters:
        log - This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java's interface java.lang.Appendable which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out'Sends text to the standard-out terminal
        Torello.Java.StorageWriterSends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes
        FileOutputStream, PrintStreamMore general-purpose java text-output classes

        IMPORTANT: The interface Appendable requires that the check exception IOException must be caught when using its append(CharSequence) methods.
        query - This may be a query URL that has been prepared, by Google, to be used for the "next 10 results" of a particular search.

        Specifically: This URL should have been retrieved from a previous search-results page, and was listed as containing additional (next 10 matches) links.
        Returns:
        This shall return an instance of Ret2. The two elements that will make up the result will be as follows:

        1. Ret2.a (SearchResult[])

          This will be an array of SearchResult that (likely) contains several elements.

        2. Ret2.b (URL[])

          Search Engines return lists of next-pages in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
        Throws:
        java.io.IOException - If there is any I/O problems that occur in scraping the site.
        Code:
        Exact Method Body:
         // Use a java Stream.Builder to save the results to a Java Stream.
         // Streams are easily converted to arrays.
         Stream.Builder<SearchResult> resultsBuilder = Stream.builder();
        
         URL splashQuery = new URL(SPLASH_URL + query.toString());
        
         // Download the HTML, and save it to a java.util.Vector (like an array)
         Vector<HTMLNode> v = HTMLPage.getPageTokens(splashQuery, false);
        
         // Create a "Google Results Iterator" - each result is wrapped in an HTML Element
         // that looks like: <DIV CLASS="rc"> ... </DIV>
         HNLIInclusive resultsIter = InnerTagInclusiveIterator.get
             (v, "div", "class", TextComparitor.C, "rc");
        
         while (resultsIter.hasNext())
         {
             // Get the <DIV CLASS="rc"> ... </DIV> contents.
             Vector<HTMLNode>    result          = resultsIter.next();
        
             // The first anchor <A HREF=...> will contain the link for this search-result.
             Vector<HTMLNode>    firstLink       = TagNodeGetInclusive.first(result, "a");
        
             // The first <H3>...</H3> will contain the link-text for this URL search-result.
             Vector<HTMLNode>    mainLinkText    = TagNodeGetInclusive.first(firstLink, "h3");
        
             // If there are additional (sub-links) they will be 'wrapped' in a 
             // <DIV CLASS="s"> ... </DIV> HTML Divider Element.
             DotPair             subLinkDIV      = InnerTagFindInclusive.first
                                                     (result, "div", "class", TextComparitor.C, "s");
        
             String url      = ((TagNode) firstLink.elementAt(0)).AV("href").trim();
             String title    = Util.textNodesString(mainLinkText).trim();
        
             // Save the results in a Java Stream, using Stream.Builder.
             Stream.Builder<SearchResult> subResultsBuilder = Stream.builder();
        
             if (subLinkDIV != null)
             {
                 // To get the list of search-result sub-links, retrieve all links that are labelled
                 // <A CLASS="fl"> ... </A>
                 HNLIInclusive subLinksIter = InnerTagInclusiveIterator.get
                     (result, "a", "class", TextComparitor.C, "fl");
        
                 subLinksIter.restrictCursor(subLinkDIV);
        
                 // Iterate through any "Sub Links"  Again, a "Sub Link" is hereby being defined
                 // as a search result for a particular web-site that would be able to produce
                 // many / numerous additional links.  Often times these additional links are more
                 // useful than the primary link that was returned.
                 while (subLinksIter.hasNext())
                 {
                     Vector<HTMLNode>    subLink         = subLinksIter.next();
        
                     // Get the URL
                     String subLinkURL = ((TagNode) subLink.elementAt(0)).AV("href").trim();
        
                     // Get the text that is wrapped inside the <A HREF=..> "this-text" </A>
                     // HTML Element.  Util.textNodesString(...) simply removes all TagNodes, and
                     // appends the TextNodes together.
                     String subLinkTitle = Util.textNodesString(subLink).trim();
        
                     subResultsBuilder.accept(new SearchResult(subLinkURL, subLinkTitle));
                 }
             }
        
             // Use Java Stream's to build the SearchResult[] Array.  Call the
             // Stream.Builder.build() method, and then call the Stream.toArray(...) method.
             SearchResult[] subResults = subResultsBuilder.build().toArray(SearchResult[]::new);
        
             SearchResult sr = new SearchResult
                 (url, title, (subResults.length > 0) ? subResults : null);
        
             resultsBuilder.accept(sr);
         }
        
         // Use Java's Stream.Builder.build() to create the Stream, then easily convert
         // to an array.
         SearchResult[] srArr = resultsBuilder.build().toArray(SearchResult[]::new);
        
         // If the log is not null, print out the results.
         if (log != null)
             for (SearchResult sr : srArr) log.append(sr.toString() + '\n');
        
         // IMPORTANT NOTE:  This code will retrieve the next available 10 PAGES of 
         // SEARCH RESULTS as a URL.
         AVT criteria1 = AVT.cmp("aria-label", Pattern.compile("Page \\d++"));
         AVT criteria2 = AVT.cmp("class", TextComparitor.C, "fl");
        
         // Use these criteria specifiers to find the HTML '<A HREF...>' NEXT-PAGE in
         // SEARCH-RESULTS links...
         Vector<TagNode> nextPages = InnerTagGet.all(v, criteria1.and(criteria2), "a");
        
         URL[] urlArr = new URL[nextPages.size()];
        
         // Print out the URL for each of the next pages.  A programmer may expand this
         // answer by investigating more of these links in a loop.
         for (int i=0; i < nextPages.size(); i++)
         {
             TagNode link = nextPages.elementAt(i);
             urlArr[i++] = Links.resolveHREF(link, query);
         }
        
         return new Ret2<SearchResult[], URL[]>(srArr, urlArr);