Package Torello.HTML.Tools.SearchEngines
Class BaiDuQuery
- java.lang.Object
-
- Torello.HTML.Tools.SearchEngines.BaiDuQuery
-
public class BaiDuQuery extends java.lang.Object
BaiDuQuery (百度搜索) - Example class that makes an HTTP connection (rather than a REST invocation) to the Search-Engine.
Searches Chinese Search Engine"百度搜索" (www.BaiDu.com)
.
这些查询Java函数应接受普通话字符串。
The query methods in this class will accept Mandarin Chinese Search-String's
IMPORTANT NOTE: As of September, 2020 - the writing of this class, 百度 (BaidDu) Search Results are identified by the following:
HTML Elements:
<DIV CLASS="result ..."> <!-- **OR ** --> <DIV CLASS="result-op ..."> <!-- This divider contains / wraps the result. There will be likely 12 such HTML '<DIV CLASS="result">...</DIV>' elements on a single page of a BaiDu Search.' --> <!-- When the class is "result-op", instead of "result", it means there are likely to be "sub-matches" listed in addition to the primary search-result. A "sub-match" is where the result-site has many / multiple search results listed on it's site - rather than just one. --> <A HREF="..."> <!-- This Anchor 'A' Element and 'H3' Element contain the actual link, and the link-text. If this were to change with BaiDu, this particular search-engine class would fail. --> </A> <DIV CLASS="c-row"> <!-- If there "sub-links" available, they are "wrapped" inside of a divider ('DIV') whose CLASS="s". There is possibly other info buried here. --> </DIV> </DIV>
THIS CLASS WILL NOT WORK WITHOUT STARTING THE SPLASH-SERVER PackageTorello.HTML
contains a simple - mostly documentation / informationalclass
namedTorello.HTML.SplashBridge
. This server was not developed by the same organization as this JARJava HTML
library was. Though theSplash Server
has myriad of features; primarily it is used here to execute the script found on a web-page. When querying theSearch Bar
, the pages that are returned are heavily laden with script and AJAX calls.
In order to retrieve the HTML of the search being performed, the script calls that useAJAX, Java-Script, Type-Script, jQuery, React JS
andAngular JS
need to be completed. Currently, the most popular tool available for this task is theSelenium WebDriver
package. It "hooks up directly" to an instance ofGoogle Chrome
and asks it to execute the / any script on a given web-page, before returning the HTML to be parsed. This can be useful, but since the tool is primarily marketed as aUI (User Interface) Testing Tool
, much of the API is about performing button clicks and scroll-bar movements. Here, all that is needed to make sure that the HTML which is intended to be viewed when a page finishes loading is indeed loaded.
This package has used, so far, successfully theSplash Tool
to run the initial scripts that may be present on most web-pages.
To start Splash, you must have access to theDocker Tool
to install it on your machine. It is just a small web-server like piece of software. It will listen on port 8050, and act as a "Proxy" for calls to the Search Engine. When it receives a request for aURL
, for instance, it will execute all available script on the page first, and then return the HTML to be parsed by thisJava HTML Library
UNIX or DOS Shell Command:
Install Docker. Make sure Docker version 17 (or greater) is installed. sudo is for the UNIX command line (Super User Do), not for MS-DOS. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
This must be run and listening onport 8050
before the methods in this class can function.Microsoft Windows
Users: Please review theclass SplashBridge
for information on using theSplash HTTP Server
in aWindows Environment
via theDocker
Loading Program which has been ported toMicrosoft Windows
.
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/SearchEngines/BaiDuQuery.java
- Open New Browser-Tab: Torello/HTML/Tools/SearchEngines/BaiDuQuery.java
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 3 Method(s), 3 declared static
- 2 Field(s), 2 declared static, 1 declared final
- Fields excused from final modifier (with explanation):
Field 'SPLASH_URL' is not final. Reason: CONFIGURATION
-
-
Nested Class Summary
Nested Classes Modifier and Type Class static class
BaiDuQuery.SearchResult
-
Field Summary
Fields Modifier and Type Field static String
SPLASH_URL
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method static void
main(String[] argv)
static Ret2<BaiDuQuery.SearchResult[],
URL[]>query(Appendable log, String... argv)
static Ret2<BaiDuQuery.SearchResult[],
URL[]>query(Appendable log, URL query)
-
-
-
Field Detail
-
SPLASH_URL
public static java.lang.String SPLASH_URL
In order to use "Splash" - it is *very simple* ... Start the Server, and append thisString
to the beginning of allURL's
:- Code:
- Exact Field Declaration Expression:
public static String SPLASH_URL = "http://localhost:8050/render.html?url=";
-
-
Method Detail
-
main
public static void main(java.lang.String[] argv) throws java.io.IOException
This class may be invoked at the Command Line. The arguments passed to this class will be sent to thequery(Appendable, String[])
method. The results will be printed to the terminal.- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
query(System.out, argv);
-
query
public static Ret2<BaiDuQuery.SearchResult[],java.net.URL[]> query (java.lang.Appendable log, java.lang.String... argv) throws java.io.IOException
This will poll the nearest 百度.com Web Server for the results of a search.
IMPORANT: As explained at the top ofthis class
, this method will not work if theSplash Server
has not been installed and started on your computer.- Parameters:
log
- This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java'sinterface java.lang.Appendable
which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
IMPORTANT: Theinterface Appendable
requires that the check exceptionIOException
must be caught when using itsappend(CharSequence)
methods.argv
- This should be the list of keywords that would be typed into a百度 Search Bar
. There may be spaces here, so the entire searchString
may be sent as a single-String
, or it may be broken up into tokens and passed individually here. It is mostly irrelevant.- Returns:
- This shall return an instance of
Ret2
. The two elements that will make up the result will be as follows:-
Ret2.a (SearchResult[])
This will be an array ofSearchResult
that (likely) contains several elements.
-
Ret2.b (URL[])
Search Engines return lists ofnext-pages
in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
-
- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
StringBuilder queryBuilder = new StringBuilder(); for (int i=0; i < argv.length; i++) { String temp = argv[i].replace("+", "%2B").replace(" ", "+"); temp = StrReplace.r (temp, URL_ESC_CHARS, (int t, char c) -> '%' + Integer.toHexString((int) c)); queryBuilder.append(temp); if (i < (argv.length -1)) queryBuilder.append('+'); } String queryStr = queryBuilder.toString(); if (log != null) log.append("Query String:\n" + BYELLOW + queryStr + RESET + '\n'); return query(log, new URL("https://www.baidu.com/s?wd=" + queryStr));
-
query
public static Ret2<BaiDuQuery.SearchResult[],java.net.URL[]> query (java.lang.Appendable log, java.net.URL query) throws java.io.IOException
This will poll the nearest 百度 Web Server for the results of a search - given a providedURL
. TheURL
provided by this method ought to be one of theURL's
retrieved from the "next" button - as was returned by a previous search engine query (theRet2.b
list ofURL's
).
IMPORANT: As explained at the top ofthis class
, this method will not work if theSplash Server
has not been installed and started on your computer.
Here is the "core HTML retrieve operation" for aBaiDu.com Search Bar Result
// Create a "百度 Results Iterator" - each result is wrapped in an HTML Element // that looks like: <DIV CLASS="result"> ... </DIV> (or CLASS="result-op") HNLIInclusive resultsIter = InnerTagInclusiveIterator.get (v, "div", "class", TextComparitor.C_OR, "result", "result-op"); while (resultsIter.hasNext()) // The first anchor <A HREF=...> will contain the link for this search-result. Vector<HTMLNode> firstLink = TagNodeGetInclusive.first(result, "a"); // Here is how the URL and Anchor text is collected String url = ((TagNode) firstLink.elementAt(0)).AV("href").trim(); String title = Util.textNodesString(firstLink).trim();
- Parameters:
log
- This is the log parameter. If this parameter is null, it shall be ingored. This parameter expects an implementation of Java'sinterface java.lang.Appendable
which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
IMPORTANT: Theinterface Appendable
requires that the check exceptionIOException
must be caught when using itsappend(CharSequence)
methods.query
- This may be a queryURL
that has been prepared, by 百度, to be used for the "next 10 results" of a particular search.
Specifically: ThisURL
should have been retrieved from a previous search-results page, and was listed as containing additional (next 10 matches) links.- Returns:
- This shall return an instance of
Ret2
. The two elements that will make up the result will be as follows:-
Ret2.a (SearchResult[])
This will be an array ofSearchResult
that (likely) contains several elements.
-
Ret2.b (URL[])
Search Engines return lists ofnext-pages
in order to retrieve the next bunch of search results for a query. These are the links that this page has provided for those next-pages of search-results.
-
- Throws:
java.io.IOException
- If there is any I/O problems that occur in scraping the site.- Code:
- Exact Method Body:
// Use a java Stream.Builder to save the results to a Java Stream. // Streams are easily converted to arrays. Stream.Builder<SearchResult> resultsBuilder = Stream.builder(); URL splashQuery = new URL(SPLASH_URL + query.toString()); // Download the HTML, and save it to a java.util.Vector (like an array) Vector<HTMLNode> v = HTMLPage.getPageTokens(splashQuery, false, "out.html", null, null); // Create a "Google Results Iterator" - each result is wrapped in an HTML Element // that looks like: <DIV CLASS="rc"> ... </DIV> HNLIInclusive resultsIter = InnerTagInclusiveIterator.get (v, "div", "class", TextComparitor.C_OR, "result", "result-op"); while (resultsIter.hasNext()) { // Get the <DIV CLASS="rc"> ... </DIV> contents. Vector<HTMLNode> result = resultsIter.next(); // The first anchor <A HREF=...> will contain the link for this search-result. Vector<HTMLNode> firstLink = TagNodeGetInclusive.first(result, "a"); String url = ((TagNode) firstLink.elementAt(0)).AV("href").trim(); String title = Util.textNodesString(firstLink).trim(); // Save the results in a Java Stream, using Stream.Builder. Stream.Builder<SearchResult> subResultsBuilder = Stream.builder(); // To get the list of search-result sub-links, retrieve all links that are labelled // <DIV CLASS="c-row"> ... </A> HNLIInclusive subLinksIter = InnerTagInclusiveIterator.get (result, "div", "class", TextComparitor.C, "c-row"); // Iterate through any "Sub Links" Again, a "Sub Link" is hereby being defined // as a search result for a particular web-site that would be able to produce // many / numerous additional links. Often times these additional links are more // useful than the primary link that was returned. while (subLinksIter.hasNext()) { Vector<HTMLNode> div = subLinksIter.next(); // System.out.println(Util.pageToString(div) + "\n********* RT **********************\n"); // The link / search-result itself is the first HTML Anchor Element (<A HREF=...>...</A>) DotPair subLink = TagNodeFindInclusive.first(div, "A"); if (subLink == null) continue; // Get the URL String subLinkURL = ((TagNode) div.elementAt(subLink.start)).AV("href").trim(); // The first URL returned is just the one we have already retrieved. if (subLinkURL.equalsIgnoreCase(url)) continue; // Get the text that is wrapped inside the <A HREF=..> "this-text" </A> // HTML Element. Util.textNodesString(...) simply removes all TagNodes, and // appends the TextNodes together. String subLinkTitle = Util.textNodesString(div, subLink).trim(); subResultsBuilder.accept(new SearchResult(subLinkURL, subLinkTitle)); } // Use Java Stream's to build the SearchResult[] Array. Call the // Stream.Builder.build() method, and then call the Stream.toArray(...) method. SearchResult[] subResults = subResultsBuilder.build().toArray(SearchResult[]::new); SearchResult sr = new SearchResult (url, title, (subResults.length > 0) ? subResults : null); resultsBuilder.accept(sr); } // Use java's Stream.Builder.build() to create the Stream, then easily convert // to an array. SearchResult[] srArr = resultsBuilder.build().toArray(SearchResult[]::new); // If the log is not null, print out the results. if (log != null) for (SearchResult sr : srArr) log.append(sr.toString() + '\n'); // IMPORTANT NOTE: This code will retrieve the next available 10 PAGES of // SEARCH RESULTS as a URL. DotPair nextResultsDIV = InnerTagFindInclusive.first (v, "div", "id", TextComparitor.EQ, "page"); // Use these criteria specifiers to find the HTML '<A HREF...>' NEXT-PAGE in // SEARCH-RESULTS links... Vector<DotPair> nextPages = InnerTagFindInclusive.all (v, nextResultsDIV.start, nextResultsDIV.end, "a", "href"); URL[] urlArr = new URL[nextPages.size()]; /* log.append( "nextResultsDIV.size(): " + nextResultsDIV.size() + '\n' + "urlArr.length: " + urlArr.length + '\n' ); */ // The URL for each of the next pages. A programmer may expand this // answer by investigating more of these links in a loop. for (int i=0; i < nextPages.size(); i++) { DotPair link = nextPages.elementAt(i); String href = ((TagNode) v.elementAt(link.start)).AV("href"); urlArr[i] = Links.resolve(href, query); // log.append(urlArr[i].toString() + '\n'); } return new Ret2<SearchResult[], URL[]>(srArr, urlArr);
-
-