Class ImageScraper


  • public class ImageScraper
    extends java.lang.Object
    ImageScraper-Suite Class
    The ImageScraper Tool itself includes three 'Helper-Classes' that facilitate its operations. These three Helpers include: Request, Results and ImageInfo.

    Building a Request:
    Building an Image-Download Request instance really should be extremely easy, and there is an example of doing just that at the top of the Request class. Properly configuring the class to handle any / all possible errors or exceptions that might occur when downloading images from a web-server requires a little reading of the JavaDoc pages provided by these tools.

    The Request class includes several boolean's for supressing / skipping exception if they occur during the download loop / process iteration. If an exception is thrown and suppressed, it will simply be logged to the Results class.

    Once a Request Object has been built, simply pass that object-instance to the ImageScraper method download and a download-process will begin.

    Request's Lambda-Targets:
    If the Request object contained any Lambda-Target / Function-Pointers, then those Lambda-Methods will be passed instances of the 'Helper-Class' ImageInfo when they are invoked by the download-loop. These Function-Pointers provide just a few features that allow a programmer to do things like filter-out certain Image-URL's and also do things like decide where a downloaded Image is ultimately stored.

    Finally, when the download-loop has run to completely, it will return an instance of class Results

    Getting Results:
    After the ImageScraper.download(...) loop has run to completion, an instance of class Results will be returned tot he user, and it will simply contain several parallel-arrays that hold / store data about what transpired when trying to download each of the Image-URL's which were passed to the Request-Object.

    For instance the 'skipped' array will indicate which pictures didn't download. The 'fileNames' array will hold the name of the file of each image that was successfully downloaded. And the 'imageFormats' will identify which format was ultimately decided-upon when saving the image.

    Remember that each of these return-arrays are parallel to eachother, and (or course) will be identical in length. Furthermore, as per the definition of "Parallel-Arrays", the element residing at any index will always correspond to the same image in any one of the other arrays.
    A more advanced class for both downloading and saving a list of images, using URL's.

    ImageScraper (previously called "ImageScrape2" allows more fine-grained control for how the images are saved and downloaded. Though this class seems extremely complicated, parameter-wise, ultimately these allow many alternate versions of what to do with downloaded images, where to save them, and how to name them. It even can deal with "Base-64 String Encoded Images" (images which are encoded using the <IMG SRC="data:image/jpeg..."> SRC-Attribute) with ease.

    Prevent Hangs with TimeOut:
    Note that this class uses monitor threads to ensure that image-downloads do not exceed a certain wait time. You may modify this Maximum Wait-Time using the parameters in class Request. This class (class ImageScraper) is thread-safe class. It does not use any global or static-global variables - except a Thread Thread-Pool 'Executor'. The 'Executor' is locked in a Thread-Safe manner using a Semaphore-Lock from java.util.concurrent.locks.

    Spawns a Monitor Thread:
    If this class is used by a programmer, when the programmer's program is ready to exit, he or she might not see his program exit immediately. Java's class 'Executors' builds a thread-pool, and a time-out thread. This time-out thread stays alive (but unused) most of the time.

    If you have used this class, make sure to call the following method before your program completes, or you may find it idly-waiting for up to 30-seconds before dying and relinquishing control back to your operating-system.
    // Call this before your program terminates! 
    // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating,
    // before the JRE finally kills the monitor-thread.
    
    ImageScraper.shutdownTOThreads();
    



    Stateless Class:
    This class neither contains any program-state, nor can it be instantiated. The @StaticFunctional Annotation may also be called 'The Spaghetti Report'. Static-Functional classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's @Stateless Annotation.

    • 1 Constructor(s), 1 declared private, zero-argument constructor
    • 14 Method(s), 14 declared static
    • 3 Field(s), 3 declared static, 3 declared final


    • Method Summary

       
      Download & Save a List of Images, using a Request instance
      Modifier and Type Method
      static Results download​(Request request, Appendable log)
       
      Download all Images on a single HTML-Page into a Directory
      Modifier and Type Method
      static Ret2<int[],​Results> localizeImages​(Vector<HTMLNode> page, URL pageURL, Appendable log, String targetDirectory)
       
      Finalize class ImageScraper
      Modifier and Type Method
      static void shutdownTOThreads()
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • shutdownTOThreads

        🡇     🗕  🗗  🗖
        public static void shutdownTOThreads()
        If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.

        Before Exiting:
        When a program you have written reaches the end of its code, if you have performed any time-dependent Image-Downloads using this class (class ImageScraper), then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread dies.

        Note that you may immediately terminate any additional threads that were started using this method.
      • localizeImages

        🡅  🡇     🗕  🗗  🗖
        public static Ret2<int[],​ResultslocalizeImages​
                    (java.util.Vector<HTMLNode> page,
                     java.net.URL pageURL,
                     java.lang.Appendable log,
                     java.lang.String targetDirectory)
                throws java.io.IOException,
                       ImageScraperException
        
        Downloads images located inside an HTML Page and updates the SRC=... URL's so that the links point to a local copy of local images.

        After completion of this method, an HTML page which contained any HTML image elements will have had those images downloaded to the local file-system, and also have had the HTML attribute 'src=...' changed to reflect the local image name instead of the Internet URL name.
        Parameters:
        page - Any vectorized-html page or subpage. This page should have HTML <IMG ...> elements in it, or else this method will exit without doing anything.
        pageURL - If any of the HTML image elements have src='...' attributes that are partially resolved or relative URL's then this can be passed to the ImageScraper constructors in order to convert partial or relative URL's into complete URL's. The Image Downloader simply cannot work with partially resolved URL's, and will skip them if they are partially resolved. This parameter may be null, but if it is and there are incomplete-URL's those images will simply not be downloaded.
        log - This is the 'logger' for this method. It may be null, and if it is - no output will be sent to the terminal. This expects an implementation of Java's java.lang.Appendable interface which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out' Sends text to the standard-out terminal
        Torello.Java.StorageWriter Sends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriter General purpose java text-output classes
        FileOutputStream, PrintStream More general-purpose java text-output classes

        Checked IOException:
        The Appendable interface requires that the Checked-Exception IOException be caught when using its append(...) methods.
        targetDirectory - This File-System directory where these files shall be stored.
        Returns:
        An instance of Ret2<int[], Results>. The two returned elements of this class include:

        • Ret2.a (int[])

          This shall contain an index-array for the indices of each HTML '<IMG SRC=...>' element found on the page. It is not guaranteed that each of images will have been resolved or downloaded successfully, but rather just that an HTML 'IMG' element that had a 'SRC' attribute. The second element of this return-type will contain information regarding which images downloaded successfully.

        • Ret2.b (Results)

          The second element of the return-type shall be the instance of Results returned from the invocation of ImageScraper.download(...). This method will provide details about each of the images that were downloaded; or, if the download failed, the reasons for the failure. This return element shall be null if no images were found on the page.

        These return Object references are not necessarily important - and they may be discarded if needed. They are provided as a matter of utility if further verification or research into successful downloads is needed.
        Throws:
        java.io.IOException - I/O Problems that weren't avoided.
        ImageScraperException - Thrown for any number of errors that went unsuppressed.
        Code:
        Exact Method Body:
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Find all of the Image TagNode's on the Input Web-Page
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         int[]           imgPosArr   = TagNodeFind.all(page, TC.Both, "img");
         Vector<TagNode> vec         = new Vector<>();
        
         // No Images Found.
         if (imgPosArr.length == 0) return new Ret2<int[], Results>(imgPosArr, null);
        
         for (int pos : imgPosArr) vec.addElement((TagNode) page.elementAt(pos));
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Build a Request and Download all of the Image's that were just found / identified
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         Request request = Request.buildFromTagNodeIter(vec, pageURL, true);
         request.targetDirectory = targetDirectory;
        
         // NOTE: This is NOT FINISHED:
         // SET ALL OF THE "Skip On Exception" booleans to TRUE!!!
        
         // Invoke the Main Image Downloader
         Results r = ImageScraper.download(request, log);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Replace the <IMG SRC=...> TagNode URL's for images that were successfully downloaded.
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         // Now replace 
         ReplaceFunction replacer = (HTMLNode n, int arrPos, int count) ->
         {
             if (r.skipped[count] == false)
        
                 return ((TagNode) page.elementAt(arrPos))
                         .setAV("src", r.fileNames[count], SD.SingleQuotes);
        
             else return (TagNode) n;
         };
            
         ReplaceNodes.r(page, imgPosArr, replacer);
        
        
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
         // Report the Results
         // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
        
         return new Ret2<int[], Results>(imgPosArr, r);
        
      • download

        🡅     🗕  🗗  🗖
        public static Results download​(Request request,
                                       java.lang.Appendable log)
                                throws ImageScraperException
        This will iterate through the URL's and download them. Note that parameter 'log' may be null, and if so, it will be quietly ignored.
        Parameters:
        request - This parameter takes customization requests for batch image downloads. To read more information about how to configure a download, please review the documentation for the class Request.

        Note that upon entering this method, this parameter is immediately cloned to prevent the possibility of Thread Concurrency Problems from happening. After cloning, the the cloned instance is used exclusively, and the original parameter is discarded. Further changes to the parameter-instance will not have any effect on the process.
        log - This shall receive text / log information. This parameter may receive null, and if it does it will be ignored. When ignored, logging information will not printed. This expects an implementation of Java's java.lang.Appendable interface which allows for a wide range of options when logging intermediate messages.
        Class or Interface InstanceUse & Purpose
        'System.out' Sends text to the standard-out terminal
        Torello.Java.StorageWriter Sends text to System.out, and saves it, internally.
        FileWriter, PrintWriter, StringWriter General purpose java text-output classes
        FileOutputStream, PrintStream More general-purpose java text-output classes

        Checked IOException:
        The Appendable interface requires that the Checked-Exception IOException be caught when using its append(...) methods.
        Returns:
        an instance of class Results for the download. The Results class contains several parallel arrays with information about images that have downloaded. If an image-download happens to fail due to an improperly formed URL (or an 'incorrect' URL), then the information in the Results arrays will contain a 'null' value for the index at those array-positions corresponding to the failed image.
        Throws:
        ImageScraperException - Thrown for any number of exceptions that may be thrown while executing the download-loop. If another exception is thrown, then it is wrapped by this class' exception (ImageScraperException), and set as the 'cause' of that exception.
        AppendableError - The interface java.lang.Appendable was designed to allow for an implementation to throw the (unchecked) exception IOException. This has many blessings, but can occasionally be a pain since, indeed, IOException is both an unchecked exception (and requires an explicity catch), and also very common (even ubiquitous) inside of HTTP download code.

        If the user-provided 'log' parameter throws an IOException for simply trying to write character-data to the log about the download-progress, then an AppendableError will be thrown. Note that this throwable does inherit java.lang.Error, meaning that it won't be caught by standard Java catch clauses (unless 'Error' is explicity mentioned!)
        Code:
        Exact Method Body:
         // Clone the Request, Similar to "SafeVarArgs" - Specifically, if the user starts playing
         // with the contents of this class in the middle of a download, it will not have any effect
         // on the 'request' object that is actually being used.
        
         request = request.clone();        
            
         // Runs a few tests to make sure there are no problems using the request
         request.CHECK();
        
         // Makes log printing easier and easier.
         AppendableLog al = new AppendableLog(log, request.verbosity);
        
         // Main Request-Configuration and Response Class Instances.
         Results results = new Results(request.size);
        
         // Private, Internal Static-Class.  Makes passing variables even easier
         RECORD r = new RECORD(request, results, al);
        
         // Now, this just gets rid of the surrounding try-catch block.  This is the only real
         // reason for the internal/private method 'downloadWithoutTryCatch'.  This makes the
         // indentation look a lot better.  Also, in this method, the 'log' is replaced with the
         // AppendableSafe log
        
         try 
         {
             mainDownloadLoop(r);
             return results;
         }
        
         catch (ImageScraperException e)
         {
             // If an exception causes the system to stop/halt, this extra '\n\n' makes the output
             // text look a little nicer (sometimes... Sometimes it already looks fine).
             // No more no less.
        
             if (al.hasLog) al.append("\n\nThrowing ImageScraperException...\n");
             throw e;
         }