public class SplashBridge extends java.lang.ObjectDemonstrates using 'Splash,' which is one of many ways to execute the Java-Script on Web-Pages, before those pages are parsed.
This class is more like the
MIMEclass in the Java Package, because this class is really only here to provide an a good-example for contacting an already-up-and-running Splash Server. NOTE: In the MIME class, there are just lists of software-tools all of which were once very useful - but the class itself doesn't do anything at all. This class, also, does nothing at all - other than download a copy of the Wikipedia Page for Christopher Columbus. Since these JavaDoc Pages all contain the source code for the method bodies that implement the methods, please review how to arrange the proper request into a URL when polling a Splash HTTP Server.
Running the utility on an instance of Google Cloud Shell, and bringing a Splash Server up andbrunning seemed to work on the first try. The commands for starting Splash are documented on their main documentation web-page:
I typed the two commands that were expected, and since Google already has the required "docker" program on their system, the HTTP Server seemed to start right up.
Splash is claiming to be a more light-weight alternative to the Selenium-Package for both polling a web-server and executing and running any Java-Script methods available on the page. The API that they export seems to be in the "Lua" language; however, since making calls to the server only requires an HTTP Connection and since the responses that a Splash HTTP Server will return are just standard HTTP HTML responses, 'Lua" is not required reading!
Including an example here in this package seems reasonable. Making calls to an HTTP server is handled very well in Java already, and this package is great at parsing HTML results.
Not being a user of Selenium or Splash for intricate or complex Java-Script interactions with a web-page, there is no formal explanation of what (or "if") is "buggy" about this external software tool. Generally, when scraping foreign news sources, there is no Java-Script at all to worry about! It seems to work fine, after installation.
There have been quite a few times when gathering stories, from Wikipedia for example, that a web-scrape was not returning the same output that was sent to a desktop web-browser. This 'Splash API' appears to be able to wait for all possible Java-Script functions to execute before returning HTML to Java - which warrants a "Bridge Class" in this package. This bridge doesn't return the "raw HTML" returned from the server, it returns the Java-Script Post-Processed HTML.
Actually making calls to individual methods on the page may require some knowledge of the Lua Programming Language, or changing to Selenium altogether. However, since this is mostly a REST/JSON API, making API calls to the HTTP Server - even when requesting Lua Scripts to execute should not be difficult from a Java Class, if the Splash Documentation is correct.
Fields Modifier and Type Field
public static final java.lang.String SPLASH_URLOnce the
Splash HTTP Serveris running (which requires the
Dockerloading and installation tool, all one has to do is prepend this
URL, and the
Splash Script Executorwill be invoked on the HTML and Script that is received from that
String myURL = "https://cars.com"; URL withSplashServerURL = new URL(SplashBridge.SPLASH_URL + myURL); // Here, just use the standard HTML scrape and parsing routines to retrieve the HTML // from the URL 'myURL'. Splash will execute any 'dynamic HTML' that is loaded via the // standard script libraries like AJAX, JSON, React-JS, jQuery, or Angular. Vector<HTMLNode> html = HTMLPage.getPageTokens(withSplashServerURL, false); // NOTE: The above invocation will not call the "www.cars.com" server, BUT RATHER, will // ask the HTTP Server running on the local host as a PROXY to retrieve the HTML // from "www.cars.com". Before returning that HTML, the local proxy server will also // execute the dynamic-loading script that is present on the main page of "cars.com" // // ALSO: There are other libraries that perform this type of work: Selenium, and Android // class WebView.
- See Also:
- Constant Field Values
- Exact Field Declaration Expression:
public static final String SPLASH_URL = "http://localhost:8050/render.html?url=";
Starting the Splash HTTP-Server:
UNIX or DOS Shell Command:
Install Docker. Make sure Docker version >= 17 is installed. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
Here is an (approximate) commentary about how to run the
Splash HTTP Serveron a
Is there a Microsoft Windows version of the Splash HTTP Server (May, 2016)?
Can't find any mentioning in docs; And
bin/also appears not meant for
Splashshould work fine in
Microsoft Windowsif executed in a
Splash APIinstall instructions should be the same, once the
Docker Installeris installed.
Docker Installer Installation Instructionsfor info on how to install
java.io.IOException- If there are any
HTTPerrors when downloading or processing the HTML.
- Exact Method Body:
// Call the splash-bridge running on local-host @ port 8050 // The "wait" parameter means it will wait up to four seconds to run java-script AJAX // data-retrieval tasks that are on the page. String urlStr = "http://localhost:8050/render.html?url=" + "https://en.wikipedia.org/w/index.php?title=Christopher_Columbus&oldid=924321156" + "&timeout=10&wait=4.0"; URL url = new URL(urlStr); // This will just use the standard Java HTTP URLConnection class to connect to the exact // same page. String urlStr2 = "https://en.wikipedia.org/w/index.php?title=Christopher_Columbus&oldid=924321156"; URL url2 = new URL(urlStr2); // Download both versions. This version is contacting a Splash Server on a local host // running @ port 8050 // NOTE: This writes the HTML to a Flat-File on the File-System. Vector<HTMLNode> v = HTMLPage.getPageTokens(url, false); FileRW.writeFile(Util.pageToString(v), "cc.html"); // This version is contacting Wikipedia.com, and ignoring any possible AJAX or Java-Script // calls - script calls of any kind are being ignored by this version. // NOTE: This writes the HTML to a Flat-File on the File-System. Vector<HTMLNode> v2 = HTMLPage.getPageTokens(url2, false); FileRW.writeFile(Util.pageToString(v2), "cc2.html"); // FileOutput Size: Version 1: 650737 Nov 4 18:28 cc.html // FileOutput Size: Version 2: 493879 Nov 4 18:28 cc2.html // RESULTS: Clearly there is quite a bit of downloaded data from AJAX & Splash