Class Example01


  • public class Example01
    extends java.lang.Object
    An example of this package's utility. This class is used to initiate a connection to a headless Chrome-Instace, and visiting a page.

    Viewing the Output:
    The text output which is generated by this Example - the text printed to the Terminal Output - may be viewed in the link below:

    Example01.out.html

    Installing Chrome in GCP Cloud Shell:


    These are the commands that I type inside of a GCP (Google Cloud Platform) Debian Terminal/Shell to make sure that a Chrome Headless Browser is working. ChatGPT explained it to me, and wrote me a shell script to do the installation. I only do development on cloud servers, rather than local machines. I use laptops way too much.

    πŸ”‘ If you are programming using your own computer, you likely already have a CDP compatible web browser installed. You should skip the intallation step completely, if so.

    βœ”οΈ If you need to install chome, here's the script that A.I. wrote for me in the summer of '25. It still work great in GCP.


    UNIX or DOS Shell Command:
    ## # Update package list sudo apt-get update # Install just the essentials for headless Chrome sudo apt-get install -y \ fonts-liberation \ libnss3 \ libatk1.0-0 \ libxss1 \ libgdk-pixbuf2.0-0 \ libgtk-3-0 \ libasound2t64 \ libnspr4 \ xdg-utils \ wget \ ca-certificates # Download Chrome manually wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb # Install Chrome and auto-fix dependencies sudo dpkg -i google-chrome-stable_current_amd64.deb || sudo apt-get -fy install


    The above Shell-Commands, again, were generated by Chat-GPT on July 11th, 2025. They seem to have produced a perfect working copy being installed inside my Linux-Instance, without any errors occurring. The generated by the above commands are reproduced here.


    Starting Chrome in the Cloud:

    Once Google Chrome has been installed in your GCP Cloud Shell environment, you can start a headless Chrome instance that continues running in the background β€” even if you hit ^C, close your terminal, or go refill your drink at Starbucks.

    This isn't the same as launching a full Compute Engine instance β€” you're just spinning up a background terminal process inside your ephemeral Cloud Shell session. The process will live until you shut down your shell, or until the session times out.

    To launch Chrome headlessly in a way that ignores ^C and keyboard input:

    UNIX or DOS Shell Command:
    nohup google-chrome --headless --disable-gpu --remote-debugging-port=9222 \ --no-sandbox --disable-dev-shm-usage > /dev/null 2>&1 & disown


    This command uses:

    • nohup - Prevents the process from dying when the terminal closes or is interrupted.
    • & - Puts the Chrome process in the background immediately.
    • disown - Detaches the process from the shell's job control, so ^C has no effect.


    To check if Chrome is running later:

    UNIX or DOS Shell Command:
    ps aux | grep '[g]oogle-chrome' ## The above command should produce output such as: narrati+ 5916 5.9 1.5 34396956 249664 pts/3 S<l 20:20 0:01 /usr/bin/google-chrome ...


    To kill the headless Chrome instance when you're done:

    UNIX or DOS Shell Command:
    pkill -f 'google-chrome.*--headless' ## To kill by Process-ID kill <PID>


    Page originally drafted by ChatGPT on July 11th, 2025.
    Edited and formatted by Ralph Torello for use in the Java HTML Library documentation.



    • Field Detail

      • samAltmanURL

        🡇     🗕  🗗  🗖
        protected static final java.lang.String samAltmanURL
        The URL that is being scraped in this example
        See Also:
        Constant Field Values
        Code:
        Exact Field Declaration Expression:
         protected static final String samAltmanURL = "https://en.wikipedia.org/wiki/Sam_Altman";
        
    • Method Detail

      • main

        🡅  🡇     🗕  🗗  🗖
        public static void main​(java.lang.String[] argv)
                         throws java.lang.Exception
        This class is intended to be invoked from the Command Line.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         // Opening a WebSocket Browser-Connection to the currently running Chrome-Instance
         final WebSocketSender bws = STEP_01_openBrowserWebSocket();
        
         // Close any currently opened pages / tabs inside the browser
         STEP_02_closeAllPages(bws);
        
         // Open a Browser-Page (using 'bws') for reading Sam Altman's Wikipedia Profile
         final String targetID = STEP_03_openSamAltmanPage(bws);
        
         // Create / Build a WebSocket-Connection object to the newly opened Sam Altman Page.
         final WebSocketSender pws = STEP_04_getPageWebSocket(targetID);
        
         // Execute some Java-Script so that the scrape code may run
         final String html = STEP_05_runJavaScript(pws);
        
         // Print the Image-URL's, retrieve those URL's too
         final String[] imgURLs = STEP_06_extractImageURLs(html);
        
         // Download the Images into a download folder
         STEP_07_downloadImages(imgURLs);
        
         bws.disconnect();
         pws.disconnect();
        
      • STEP_01_openBrowserWebSocket

        🡅  🡇     🗕  🗗  🗖
        protected static WebSocketSender STEP_01_openBrowserWebSocket
                    ()
                throws java.lang.Exception
        
        This method demonstrates the first step in connecting to Chrome via the Chrome DevTools Protocol (CDP). It launches a headless instance of Chrome with remote debugging enabled and establishes the primary WebSocket connection that will be used for all subsequent CDP communication. This connection targets the browser-level control endpoint, not a tab-specific page socket.

        Internally, the method starts Chrome with a --remote-debugging-port=9222 flag, waits a few seconds to ensure Chrome is fully initialized, and queries the /json/version endpoint to retrieve WebSocket metadata. It uses that metadata to construct a WebSocketSender for JSON request-response communication with Chrome.

        If you're trying to automate or control browser behavior from Java, this is where it all begins: getting a working WebSocket connection to the Chrome backend.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         Printing.notice("Opening a WebSocket Browser Connection...");
        
         final BrowserConn browserConn = BrowserConn.getBrowserConn(9222, false);
        
         System.out.println(
             '\n' + BCYAN + "Example01.java: " + RESET +
             BRED + "Opened Browser Connection:\n" + RESET + browserConn.toString()
         );
        
         final WebSocketSender bws = browserConn.createSender(Example01.connRec);
        
         // Chat-GPT once suggested this line. I just haven't removed it.  It's not hurting anyone!
         Thread.sleep(1000);
        
         return bws;
        
      • STEP_02_closeAllPages

        🡅  🡇     🗕  🗗  🗖
        protected static void STEP_02_closeAllPages​(WebSocketSender bws)
                                             throws java.lang.Exception
        This step closes all existing pages (i.e., browser tabs) currently open in the Chrome instance. CDP allows enumeration of all tabs via a call to /json/list, and each tab provides a targetId property that can be passed to Target.closeTarget

        The method calls Target.getTargets() to obtain all open targets, then iterates through them and sends a Target.closeTarget(tID) command for each one that represents a page. This is useful to start from a clean browser state before performing automation.

        If Chrome was already running with many tabs open, this call helps ensure that subsequent tab-based automation starts in a predictable environment.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         Printing.notice("Closing All Currently Open Pages, using BrowserConn");
        
        
         // This is currently unused.  I used to filter for only the opened Wiki-Pages, but now this
         // method simply closes every open page.  No sense in deleting this line, though
        
         final Predicate<Target.TargetInfo> isSamAltman = (Target.TargetInfo t) ->
                 t.type.equals("page")
             &&  (t.url != null)
             &&  (t.url.startsWith(samAltmanURL));
        
         System.out.println
             ('\n' + BCYAN + "Example01.java: " + RESET + "Getting all tabs...");
            
         final Target.TargetInfo[] allTabs = Target
             .getTargets(null /* FilterEntry[] */)
             .exec(bws)
             .await();
        
         System.out.println
             ('\n' + BCYAN + "Example01.java: " + RESET + "Found " + allTabs.length + " tabs.");
        
         if (allTabs.length > 0)
        
             for (int i = 0; i < allTabs.length; i++)
             {
                 final String tid = allTabs[i].targetId;
                 System.out.println(BRED + "Closing Tab: " + RESET + tid);
                 Target.closeTarget(tid).exec(bws).await();
             }
        
      • STEP_03_openSamAltmanPage

        🡅  🡇     🗕  🗗  🗖
        protected static java.lang.String STEP_03_openSamAltmanPage​
                    (WebSocketSender bws)
                throws java.lang.Exception
        
        This step creates a new browser tab (a new "target") by invoking Target.createTarget with a specific URL β€” in this case, the Sam Altman Wikipedia page. This uses the WebSocketSender connection previously established to send a CDP request and parse the result.

        The return value is a Target.TargetID, as a java.lang.String object containing the tab identifier, which will be used in the next step to get its associated WebSocket.

        This step illustrates how CDP allows opening URLs without user interaction β€” one of the key features that powers headless automation.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         Printing.notice("Opening a Sam Altman Wikipedia Page, using BrowserConn.");
        
         final String targetID = Target
             .createTarget()
             .accept("url", samAltmanURL)
             .build()
             .exec(bws)
             .await();
        
         final Target.TargetInfo targetInfo = Target
             .getTargetInfo(targetID)
             .exec(bws)
             .await();
        
         System.out.println(
             '\n' + BCYAN + "Example01.java: " + RESET +
             BRED + "Created New Tab:\n" + RESET + targetInfo.toString()
         );
        
        
         // I leave these one second delays here.  AGAIN - Chat-GPT suggested them to me once.
         // Chat-GPT, in every sense of the word, knows more about my code than I do!  (The CDP 
         // Protocol is a very well understood protocol - just not in Java so much)
        
         Thread.sleep(1000);
        
         return targetID;
        
      • STEP_04_getPageWebSocket

        🡅  🡇     🗕  🗗  🗖
        protected static WebSocketSender STEP_04_getPageWebSocket​
                    (java.lang.String targetID)
                throws java.lang.Exception
        
        Once a tab is opened with a known targetId, this step retrieves the specific WebSocket endpoint associated with that tab. CDP uses one WebSocket per tab, and this is necessary for interacting with page-level domains such as Page, Runtime, or DOM.

        The method uses the /json/list HTTP-Endpoint to get metadata for all tabs and filters by targetId to find the matching webSocketDebuggerUrl. Then, it opens a new WebSocketSender for that tab.

        From this point forward, CDP messages targeting the loaded page must use this tab-specific WebSocket.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         Printing.notice("Create PageConn Web-Socket Connection to Altman's Wiki");
        
         // Attach to that Sam Altman Page (switch to tab-level WebSocket)
         final PageConn pageConn = PageConn
             .getAllPageConn(9222, false)
             .filter((PageConn pc) -> pc.id.equals(targetID))
             .findFirst()
             .orElseThrow(() -> new RuntimeException("The Page-Connection was Not found !!!"));
        
         System.out.println(
             '\n' + BCYAN + "Example01.java: " + RESET +
             BRED + "Found Page Connection to Sam Altman Wiki:\n" + RESET + pageConn.toString()
         );
        
         final WebSocketSender pws = pageConn.createSender(Example01.connRec);
        
        
         // I think this is the last one...  Wait 1 second, it might make a difference while the 
         // page actually loads, and the Web-Socket connects... I have no idea!  It's just 1 second!
        
         Thread.sleep(1000);
        
         return pws;
        
      • STEP_05_runJavaScript

        🡅  🡇     🗕  🗗  🗖
        protected static java.lang.String STEP_05_runJavaScript​
                    (WebSocketSender pws)
                throws java.lang.Exception
        
        Before sending any JavaScript commands to the browser tab, certain CDP domains must be enabled. This method sends Page.enable() and RunTime.enable() commands to inform Chrome that you intend to receive events and execute script.

        Without this step, attempts to run JavaScript, via RunTime.evaluate(), would fail or be ignored. Enabling the domains registers your WebSocket> session as a subscriber for those event types.

        Think of this as turning on the light switches β€” telling Chrome what features you intend to use during the session.
        Throws:
        java.lang.Exception
        Code:
        Exact Method Body:
         Printing.notice("Execute the needed Java Script, so the Scraper can Run");
        
         // Enable the Page domain
         System.out.println('\n' + BCYAN + "Example01.java: " + RESET + "Page.enable()");
         Page.enable(null /* Boolean */).exec(pws).await();
        
         // Enable the DOM domain
         System.out.println('\n' + BCYAN + "Example01.java: " + RESET + "DOM.enable()");
         DOM.enable(null /* String */).exec(pws).await();
        
         // Enable the Runtime domain
         System.out.println('\n' + BCYAN + "Example01.java: " + RESET + "RunTime.enable()");
         RunTime.enable().exec(pws).await();
        
         // This is the actual last one.  Make sure that the DOM & RunTime modules are running!
         Thread.sleep(1000);
        
         // 5. Evaluate the HTML via JavaScript
         System.out.println('\n' + BCYAN + "Example01.java: " + RESET + "RunTime.evaluate()");
        
         final RunTime.evaluate$$RET r = RunTime
             .evaluate()
             .accept("expression", "document.documentElement.outerHTML")
             .accept("returnByValue", true)
             .build()
             .exec(pws)
             .await();
        
         System.out.println(
             '\n' + BCYAN + "Example01.java: " + RESET + "Response RemoteObject:" + '\n' +
             r.result.toString()
         );
        
         final String html = ((JsonString) r.result.value).getString();
        
         return html;
        
      • STEP_06_extractImageURLs

        🡅  🡇     🗕  🗗  🗖
        protected static java.lang.String[] STEP_06_extractImageURLs​
                    (java.lang.String html)
                throws java.lang.Exception
        
        This method executes a custom JavaScript snippet inside the browser page and extracts the result. It uses Runtime.evaluate with the awaitPromise flag to execute asynchronous JS code and wait for a result.

        The JavaScript command fetches all image elements on the page with src attributes matching Flickr’s staticflickr.com domain. The result is a list of image URLs returned back to Java and parsed into a String[].

        This is the first real instance of cross-boundary data flow β€” using CDP to run code inside Chrome and pull results into your Java program.
        Throws:
        java.lang.Exception
        See Also:
        HTMLPage.getPageTokens(CharSequence, boolean), TagNodeFind, Attributes.retrieve(Vector, int[], String)
        Code:
        Exact Method Body:
         Printing.notice("Parsing HTML for Images Printing the URL's");
        
         final Vector<HTMLNode>      altPage = HTMLPage.getPageTokens(html, false);
         final int[]                 images  = TagNodeFind.all(altPage, TC.OpeningTags, "img");
         final String[]              imgURLs = Attributes.retrieve(altPage, images, "src");
         final int                   numImg  = imgURLs.length;
        
         System.out.println
             ('\n' + BCYAN + "Example01.java: " + RESET + "Number of Images Found: " + numImg);
        
         for (int i = 0; i < numImg; i++) System.out.println("    " + imgURLs[i]);
        
         return imgURLs;
        
      • STEP_07_downloadImages

        🡅     🗕  🗗  🗖
        protected static void STEP_07_downloadImages​(java.lang.String[] imageURLs)
                                              throws java.lang.Exception
        The final step downloads each image URL retrieved in the previous step and saves the results to disk. The filenames are derived from the tail end of the URL path, and all downloads are saved to a configurable local directory.

        This method doesn't involve CDP β€” it's just traditional HTTP file downloading using ImageScraper.download() But it completes the use case: open a tab, run JS to scrape content, and persist the result.

        This step closes the automation loop: going from page navigation to content extraction and finally saving that content offline.

        Make sure that a directory named image-downloads/ exists as a sub-directory of the directory from which this method is invoked.
        Throws:
        java.lang.Exception
        See Also:
        ImageScraper.download(Request, Appendable), Request, Results, ImageScraper.shutdownTOThreads()
        Code:
        Exact Method Body:
         Printing.notice("Download the Image's into a folder");
        
         final Stream.Builder<String> builder = Stream.builder();
        
         for (int i = 0; i < imageURLs.length; i++)
             if (imageURLs[i].startsWith("//"))
                 builder.accept("https:" + imageURLs[i]);
        
         // Build a Request-Object
         final List<String>  imgURLsList = builder.build().collect(Collectors.toList());
         final Request       req         = Request.buildFromStrIter(imgURLsList);
        
         // Add a few more Scraper-Configurations to the Request Object
         req.targetDirectory                     = "image-downloads/";
         req.useDefaultCounterForImageFileNames  = true;
         req.skipOnDownloadException             = true;
         req.verbosity                           = Verbosity.Normal;
        
         try 
             // Run the scraper, Send all Text-Output to 'System.out' (Ignore / Discard Results)
             { final Results results = ImageScraper.download(req, System.out); }
        
         catch (Exception e)
             { System.out.println(EXCC.toString(e)); }
        
         finally 
             // This needs to happen, or this entire program will hang / lock up the terminal
             { ImageScraper.shutdownTOThreads(); }