Package Torello.HTML

Class SplashBridge


  • public class SplashBridge
    extends java.lang.Object
    Demonstrates using 'Splash,' which is one of many ways to execute the Java-Script on Web-Pages, before those pages are parsed.

    This class is more like the MIME class in the Java Package, because this class is really only here to provide an a good-example for contacting an already-up-and-running Splash Server. NOTE: In the MIME class, there are just lists of software-tools all of which were once very useful - but the class itself doesn't do anything at all. This class, also, does nothing at all - other than download a copy of the Wikipedia Page for Christopher Columbus. Since these JavaDoc Pages all contain the source code for the method bodies that implement the methods, please review how to arrange the proper request into a URL when polling a Splash HTTP Server.

    FIRST: Running the development on an instance of Google Cloud Shell, getting a Splash Server up and running seemed to work on the first try. The commands for starting Splash are documented on their main documentation web-page: https://splash.readthedocs.io/en/stable/install.html. I typed the two commands expected - because Google already has the required "docker" program on their system - and the HTTP Server started right up.

    SECOND: Splash is claiming to be a more light-weight alternative to the Selenium Package for both polling a web-server and executing and running any Java-Script methods available on the page. The API that they export seems to be in the "Lua" language, HOWEVER since making calls to the server only requires an HTTP Connection AND SINCE the responses that a running Splash HTTP Server will return are just standard HTTP HTML responses, including an example here in this package seems reasonable. Making calls to an HTTP server is handled very well in Java already, and this package is great at parsing HTML results.

    FINALLY: Not being a user of Selenium or Splash for intricate or complex Java-Script interactions with a web-page, there is no formal explanation of what is "buggy" about this external software tool. Generally, when scraping foreign news sources, there is no Java-Script at all to worry about! However, there have been quite a few times when gathering stories, from Wikipedia for example, the web-scrape was not returning the same output that was sent to a desktop web-browser. This 'Splash API' appears to be able to wait for all possible Java-Script functions to execute before returning HTML to Java - which warrants a "Bridge Class" in this package. Actually making calls to individual methods on the page will require some knowledge of the Lua Programming Language, or changing to Selenium altogether. However, since this is mostly a REST/JSON API, making API calls to the HTTP Server - even when requesting Lua Scripts to execute should not be difficult from a Java Class, if the Splash Documentation is correct.


    • Field Summary

      Fields 
      Modifier and Type Field
      static String SPLASH_URL
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method
      static void example01()
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SPLASH_URL

        🡇    
        public static final java.lang.String SPLASH_URL
        Once the Splash HTTP Server is running (which requires the Docker loading and installation tool, all one has to do is prepend this String to any URL, and the Splash Script Executor will be invoked on the HTML and Script that is received from that URL;

        Example:
         String   myURL               = "https://cars.com";
         URL      withSplashServerURL = new URL(SplashBridge.SPLASH_URL +  myURL);
        
         // Here, just use the standard HTML scrape and parsing routines to retrieve the HTML
         // from the URL 'myURL'.  Splash will execute any 'dynamic HTML' that is loaded via the
         // standard script libraries like AJAX, JSON, React-JS, jQuery, or Angular.
        
         Vector<HTMLNode> html = HTMLPage.getPageTokens(withSplashServerURL, false);
         
         // NOTE: The above invocation will not call the "www.cars.com" server, BUT RATHER, will
         //       ask the HTTP Server running on the local host as a PROXY to retrieve the HTML
         //       from "www.cars.com".  Before returning that HTML, the local proxy server will also
         //       execute the dynamic-loading script that is present on the main page of "cars.com"
         // 
         // ALSO: There are other libraries that perform this type of work: Selenium, and Android
         //       class WebView.
        
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
        public static final String SPLASH_URL = "http://localhost:8050/render.html?url=";
        
    • Method Detail

      • example01

        🡅    
        public static void example01()
                              throws java.io.IOException
        The only purpose of this method is to provide an example of scraping a page that is Java-Script heavy using the "Splash JavaScript API" This class is a parallel to the Selenium Headless Web Browser. Once an instance of a "Splash HTTP Server" is running on your local-host - you may make calls to the local host on port 8050, and request that the server visit a website, execute the java-script, and return the HTML to you as an HTML string.

        STARTING SPLASH HTTP SERVER: I was able to start a splash server on my first attempt on a Google Cloud Server Shell Instance. I just typed the commands listed on their website, and it started up on the spot. As explained, the Splash JavaScript Execution Engine is written in Python, but its interface is an HTTP Server that runs on your local machine - or another machine in the office. Making calls to the server was as simple as making a URL, and calling the server like it was any other website. See the method body at the end.

        These are the commands that I called to start an instance of the HTTP Java-Script Execution Engine Web-Server on my Google Cloud Shell. It worked the first time I tried it. The documentation claims that standard HTTP requests are mostly how it works, but it also utilizes some language called "Lua" as well. I have polled it using a standard Java URL connection. Perhaps I could write Lua Scripts and use Java to send them to the server, as well. The standard foreign-news websites I have parsed and searched do not require JavaScript to be executed in order to retrieve their content.

        UNIX or DOS Shell Command:
        Install Docker. Make sure Docker version >= 17 is installed. Pull the image: $ sudo docker pull scrapinghub/splash Start the container: $ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash


        Here is an (approximate) commentary about how to run the Splash HTTP Server on a Windows Instance:

        Is there a Microsoft Windows version of the Splash HTTP Server (May, 2016)?

        REQUEST:

        Can't find any mentioning in docs; And bin/ also appears not meant for Windows.

        RESPONSE:

        Splash should work fine in Microsoft Windows if executed in a Docker Container

        Splash API install instructions should be the same, once the Docker Installer is installed.

        See: Docker Installer Installation Instructions for info on how to install Docker onto Windows.

        Throws:
        java.io.IOException - If there are any HTTP errors when downloading or processing the HTML.
        Code:
        Exact Method Body:
         // Call the splash-bridge running on local-host @ port 8050
         // The "wait" parameter means it will wait up to four seconds to run java-script AJAX
         // data-retrieval tasks that are on the page.
        
         String urlStr =
             "http://localhost:8050/render.html?url=" + 
             "https://en.wikipedia.org/w/index.php?title=Christopher_Columbus&oldid=924321156" +
                 "&timeout=10&wait=4.0";
        
         URL url = new URL(urlStr);
        
         // This will just use the standard Java HTTP URLConnection class to connect to the exact
         // same page.
        
         String urlStr2 = "https://en.wikipedia.org/w/index.php?title=Christopher_Columbus&oldid=924321156";
        
         URL url2 = new URL(urlStr2);
        
         // Download both versions.  This version is contacting a Splash Server on a local host
         // running @ port 8050
         // NOTE: This writes the HTML to a Flat-File on the File-System.
        
         Vector<HTMLNode> v = HTMLPage.getPageTokens(url, false);
        
         FileRW.writeFile(Util.pageToString(v), "cc.html");
        
         // This version is contacting Wikipedia.com, and ignoring any possible AJAX or Java-Script
         // calls - script calls of any kind are being ignored by this version.
         // NOTE: This writes the HTML to a Flat-File on the File-System.
        
         Vector<HTMLNode> v2 = HTMLPage.getPageTokens(url2, false);
        
         FileRW.writeFile(Util.pageToString(v2), "cc2.html");
        
         // FileOutput Size: Version 1: 650737 Nov  4 18:28 cc.html
         // FileOutput Size: Version 2: 493879 Nov  4 18:28 cc2.html
         // RESULTS: Clearly there is quite a bit of downloaded data from AJAX & Splash