Package Torello.HTML.Tools.SearchEngines
Class GoogleQuery
- java.lang.Object
-
- Torello.HTML.Tools.SearchEngines.GoogleQuery
-
public class GoogleQuery extends java.lang.Object
GoogleQuery - Example class that makes an HTTP connection (rather than a REST invocation) to the Search-Engine.
This class will return Google Search Results from a query to theGoogle Search Bar
.
IMPORTANT NOTE: As of September, 2020 - the writing of this class, Google Search Results are identified by the following:
HTML Elements:
<DIV CLASS="rc"> <!-- This divider contains / wraps the ENTIRE RESULT On most occasions, a single-page returned from asking the search bar will return PRECISELY 10 of these divider Elements. --> <A HREF="..."><H3>...</H3> <!-- This (first) Anchor 'A' Element and 'H3' Element contain the actual link, and the link-text. If this were to change with Google, this particular search-engine class would fail --> </A> <DIV CLASS="s"> <!-- If there "sub-links" available, they are "wrapped" inside of a divider ('DIV') whose CLASS="s". There is possibly other info buried here. --> </DIV> </DIV>
THIS CLASS WILL NOT WORK WITHOUT STARTING THE SPLASH-SERVER PackageTorello.HTML
contains a simple - mostly documentation / informationalclass
namedTorello.HTML.SplashBridge
. This server was not developed by the same organization as this JARJava HTML
library was. Though theSplash Server
has myriad of features; primarily it is used here to execute the script found on a web-page. When querying theSearch Bar
, the pages that are returned are heavily laden with script and AJAX calls.
In order to retrieve the HTML of the search being performed, the script calls that useAJAX, Java-Script, Type-Script, jQuery, React JS
andAngular JS
need to be completed. Currently, the most popular tool available for this task is theSelenium WebDriver
package. It "hooks up directly" to an instance ofGoogle Chrome
and asks it to execute the / any script on a given web-page, before returning the HTML to be parsed. This can be useful, but since the tool is primarily marketed as aUI (User Interface) Testing Tool
, much of the API is about performing button clicks and scroll-bar movements. Here, all that is needed to make sure that the HTML which is intended to be viewed when a page finishes loading is indeed loaded.
This package has used, so far, successfully theSplash Tool
to run the initial scripts that may be present on most web-pages.
To start Splash, you must have access to theDocker Tool
to install it on your machine. It is just a small web-server like piece of software. It will listen on port 8050, and act as a "Proxy" for calls to the Search Engine. When it receives a request for aURL
, for instance, it will execute all available script on the page first, and then return the HTML to be parsed by thisJava HTML Library
UNIX or DOS Shell Command:
Install Docker. Make sure Docker version 17 (or greater) is installed. sudo is for the UNIX command line (Super User Do), not for MS-DOS. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
This must be run and listening onport 8050
before the methods in this class can function.Microsoft Windows
Users: Please review theclass SplashBridge
for information on using theSplash HTTP Server
in aWindows Environment
via theDocker
Loading Program which has been ported toMicrosoft Windows
.
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/SearchEngines/GoogleQuery.java
- Open New Browser-Tab: Torello/HTML/Tools/SearchEngines/GoogleQuery.java
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 3 Method(s), 3 declared static
- 2 Field(s), 2 declared static, 1 declared final
- Fields excused from final modifier (with explanation):
Field 'SPLASH_URL' is not final. Reason: CONFIGURATION
-
-
Nested Class Summary
Nested Classes Modifier and Type Class static class
GoogleQuery.SearchResult
-
Field Summary
Fields Modifier and Type Field static String
SPLASH_URL
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method static void
main(String[] argv)
static Ret2<GoogleQuery.SearchResult[],
URL[]>query(Appendable log, String... argv)
static Ret2<GoogleQuery.SearchResult[],
URL[]>query(Appendable log, URL query)
-
-
-
Field Detail
-
SPLASH_URL
public static java.lang.String SPLASH_URL
In order to use "Splash" - it is *very simple* ... Start the Server, and append thisString
to the beginning of allURL's
:- Code:
- Exact Field Declaration Expression:
public static String SPLASH_URL = "http://localhost:8050/render.html?url=";
-
-
Method Detail
-
main
public static void main(java.lang.String[] argv) throws java.io.IOException
This class may be invoked at the Command Line. The arguments passed to this class will be sent to thequery(Appendable, String[])
method. The results will be printed to the terminal.- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
query(System.out, argv);
-
query
public static Ret2<GoogleQuery.SearchResult[],java.net.URL[]> query (java.lang.Appendable log, java.lang.String... argv) throws java.io.IOException
This will poll the nearest Google Web Server for the results of a search.
IMPORANT: As explained at the top ofthis class
, this method will not work if theSplash Server
has not been installed and started on your computer.- Parameters:
log
- This is the log parameter. If this parameter is null, it shall be ignored. This parameter expects an implementation of Java'sinterface java.lang.Appendable
which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
IMPORTANT: Theinterface Appendable
requires that the check exceptionIOException
must be caught when using itsappend(CharSequence)
methods.argv
- This should be the list of keywords that would be typed into aGoogle Search Bar
. There may be spaces here, so the entire searchString
may be sent as a single-String
, or it may be broken up into tokens and passed individually here. It is mostly irrelevant.- Returns:
- This shall return an instance of
Ret2
. The two elements that will make up the result will be as follows:-
Ret2.a (SearchResult[])
This will be an array ofSearchResult
that (likely) contains several elements.
-
Ret2.b (URL[])
Search Engines return lists ofnext-pages
in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
-
- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
StringBuilder queryBuilder = new StringBuilder(); for (int i=0; i < argv.length; i++) { String temp = argv[i].replace("+", "%2B").replace(" ", "+"); temp = StrReplace.r (temp, URL_ESC_CHARS, (int t, char c) -> '%' + Integer.toHexString((int) c)); queryBuilder.append(temp); if (i < (argv.length -1)) queryBuilder.append('+'); } String queryStr = queryBuilder.toString(); if (log != null) log.append("Query String:\n" + BYELLOW + queryStr + RESET + '\n'); return query(log, new URL("https://google.com/search?q=" + queryStr));
-
query
public static Ret2<GoogleQuery.SearchResult[],java.net.URL[]> query (java.lang.Appendable log, java.net.URL query) throws java.io.IOException
This will poll the nearest Google Web Server for the results of a search - given a providedURL
. TheURL
provided by this method ought to be one of theURL's
retrieved from the "next" button - as was returned by a previous search engine query (theRet2.b
list ofURL's
).
IMPORANT: As explained at the top ofthis class
, this method will not work if theSplash Server
has not been installed and started on your computer.
Here is the "core HTML retrieve operation" for aGoogle Search Bar Result
// Create a "Google Results Iterator" - each result is wrapped in an HTML Element // that looks like: <DIV CLASS="rc"> ... </DIV> HNLIInclusive resultsIter = InnerTagInclusiveIterator.get (v, "div", "class", TextComparitor.C, "rc"); while (resultsIter.hasNext()) // The first anchor <A HREF=...> will contain the link for this search-result. Vector<HTMLNode> firstLink = TagNodeGetInclusive.first(result, "a"); // The first <H3>...</H3> will contain the link-text for this URL search-result. Vector<HTMLNode> mainLinkText = TagNodeGetInclusive.first(firstLink, "h3"); // This is how the URL and Anchor text is retrieved String url = ((TagNode) firstLink.elementAt(0)).AV("href").trim(); String title = Util.textNodesString(mainLinkText).trim();
- Parameters:
log
- This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java'sinterface java.lang.Appendable
which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
IMPORTANT: Theinterface Appendable
requires that the check exceptionIOException
must be caught when using itsappend(CharSequence)
methods.query
- This may be a queryURL
that has been prepared, by Google, to be used for the "next 10 results" of a particular search.
Specifically: ThisURL
should have been retrieved from a previous search-results page, and was listed as containing additional (next 10 matches) links.- Returns:
- This shall return an instance of
Ret2
. The two elements that will make up the result will be as follows:-
Ret2.a (SearchResult[])
This will be an array ofSearchResult
that (likely) contains several elements.
-
Ret2.b (URL[])
Search Engines return lists ofnext-pages
in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
-
- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
// Use a java Stream.Builder to save the results to a Java Stream. // Streams are easily converted to arrays. Stream.Builder<SearchResult> resultsBuilder = Stream.builder(); URL splashQuery = new URL(SPLASH_URL + query.toString()); // Download the HTML, and save it to a java.util.Vector (like an array) Vector<HTMLNode> v = HTMLPage.getPageTokens(splashQuery, false); // Create a "Google Results Iterator" - each result is wrapped in an HTML Element // that looks like: <DIV CLASS="rc"> ... </DIV> HNLIInclusive resultsIter = InnerTagInclusiveIterator.get (v, "div", "class", TextComparitor.C, "rc"); while (resultsIter.hasNext()) { // Get the <DIV CLASS="rc"> ... </DIV> contents. Vector<HTMLNode> result = resultsIter.next(); // The first anchor <A HREF=...> will contain the link for this search-result. Vector<HTMLNode> firstLink = TagNodeGetInclusive.first(result, "a"); // The first <H3>...</H3> will contain the link-text for this URL search-result. Vector<HTMLNode> mainLinkText = TagNodeGetInclusive.first(firstLink, "h3"); // If there are additional (sub-links) they will be 'wrapped' in a // <DIV CLASS="s"> ... </DIV> HTML Divider Element. DotPair subLinkDIV = InnerTagFindInclusive.first (result, "div", "class", TextComparitor.C, "s"); String url = ((TagNode) firstLink.elementAt(0)).AV("href").trim(); String title = Util.textNodesString(mainLinkText).trim(); // Save the results in a Java Stream, using Stream.Builder. Stream.Builder<SearchResult> subResultsBuilder = Stream.builder(); if (subLinkDIV != null) { // To get the list of search-result sub-links, retrieve all links that are labelled // <A CLASS="fl"> ... </A> HNLIInclusive subLinksIter = InnerTagInclusiveIterator.get (result, "a", "class", TextComparitor.C, "fl"); subLinksIter.restrictCursor(subLinkDIV); // Iterate through any "Sub Links" Again, a "Sub Link" is hereby being defined // as a search result for a particular web-site that would be able to produce // many / numerous additional links. Often times these additional links are more // useful than the primary link that was returned. while (subLinksIter.hasNext()) { Vector<HTMLNode> subLink = subLinksIter.next(); // Get the URL String subLinkURL = ((TagNode) subLink.elementAt(0)).AV("href").trim(); // Get the text that is wrapped inside the <A HREF=..> "this-text" </A> // HTML Element. Util.textNodesString(...) simply removes all TagNodes, and // appends the TextNodes together. String subLinkTitle = Util.textNodesString(subLink).trim(); subResultsBuilder.accept(new SearchResult(subLinkURL, subLinkTitle)); } } // Use Java Stream's to build the SearchResult[] Array. Call the // Stream.Builder.build() method, and then call the Stream.toArray(...) method. SearchResult[] subResults = subResultsBuilder.build().toArray(SearchResult[]::new); SearchResult sr = new SearchResult (url, title, (subResults.length > 0) ? subResults : null); resultsBuilder.accept(sr); } // Use Java's Stream.Builder.build() to create the Stream, then easily convert // to an array. SearchResult[] srArr = resultsBuilder.build().toArray(SearchResult[]::new); // If the log is not null, print out the results. if (log != null) for (SearchResult sr : srArr) log.append(sr.toString() + '\n'); // IMPORTANT NOTE: This code will retrieve the next available 10 PAGES of // SEARCH RESULTS as a URL. AVT criteria1 = AVT.cmp("aria-label", Pattern.compile("Page \\d++")); AVT criteria2 = AVT.cmp("class", TextComparitor.C, "fl"); // Use these criteria specifiers to find the HTML '<A HREF...>' NEXT-PAGE in // SEARCH-RESULTS links... Vector<TagNode> nextPages = InnerTagGet.all(v, criteria1.and(criteria2), "a"); URL[] urlArr = new URL[nextPages.size()]; // Print out the URL for each of the next pages. A programmer may expand this // answer by investigating more of these links in a loop. for (int i=0; i < nextPages.size(); i++) { TagNode link = nextPages.elementAt(i); urlArr[i++] = Links.resolveHREF(link, query); } return new Ret2<SearchResult[], URL[]>(srArr, urlArr);
-
-