Package Torello.HTML.Tools.Images
Class ImageScraper
- java.lang.Object
-
- Torello.HTML.Tools.Images.ImageScraper
-
public class ImageScraper extends java.lang.Object
ImageScraper-Suite Class
TheImageScraperTool itself includes three 'Helper-Classes' that facilitate its operations. These three Helpers include:Request,ResultsandImageInfo.
Building a Request:
Building an Image-DownloadRequestinstance really should be extremely easy, and there is an example of doing just that at the top of theRequestclass. Properly configuring the class to handle any / all possible errors or exceptions that might occur when downloading images from a web-server requires a little reading of the JavaDoc pages provided by these tools.
TheRequestclass includes several boolean's for supressing / skipping exception if they occur during the download loop / process iteration. If an exception is thrown and suppressed, it will simply be logged to theResultsclass.
Once aRequestObject has been built, simply pass that object-instance to theImageScrapermethoddownloadand a download-process will begin.Request'sLambda-Targets:
If theRequestobject contained any Lambda-Target / Function-Pointers, then those Lambda-Methods will be passed instances of the 'Helper-Class'ImageInfowhen they are invoked by the download-loop. These Function-Pointers provide just a few features that allow a programmer to do things like filter-out certain Image-URL'sand also do things like decide where a downloaded Image is ultimately stored.
Finally, when the download-loop has run to completely, it will return an instance of classResults
Getting Results:
After theImageScraper.download(...)loop has run to completion, an instance of classResultswill be returned tot he user, and it will simply contain several parallel-arrays that hold / store data about what transpired when trying to download each of the Image-URL'swhich were passed to theRequest-Object.
For instance the 'skipped' array will indicate which pictures didn't download. The 'fileNames' array will hold the name of the file of each image that was successfully downloaded. And the 'imageFormats' will identify which format was ultimately decided-upon when saving the image.
Remember that each of these return-arrays are parallel to eachother, and (or course) will be identical in length. Furthermore, as per the definition of "Parallel-Arrays", the element residing at any index will always correspond to the same image in any one of the other arrays.A more advanced class for both downloading and saving a list of images, using URL's.
ImageScraper (previously called"ImageScrape2"allows more fine-grained control for how the images are saved and downloaded. Though this class seems extremely complicated, parameter-wise, ultimately these allow many alternate versions of what to do with downloaded images, where to save them, and how to name them. It even can deal with "Base-64 String Encoded Images" (images which are encoded using the<IMG SRC="data:image/jpeg...">SRC-Attribute) with ease.
Prevent Hangs with TimeOut:
Note that this class uses monitor threads to ensure that image-downloads do not exceed a certain wait time. You may modify this Maximum Wait-Time using the parameters in classRequest. This class (classImageScraper) is thread-safe class. It does not use any global or static-global variables - except a Thread Thread-Pool'Executor'. The'Executor'is locked in a Thread-Safe manner using a Semaphore-Lock fromjava.util.concurrent.locks.
Spawns a Monitor Thread:
If this class is used by a programmer, when the programmer's program is ready to exit, he or she might not see his program exit immediately. Java's class'Executors'builds a thread-pool, and a time-out thread. This time-out thread stays alive (but unused) most of the time.
If you have used this class, make sure to call the following method before your program completes, or you may find it idly-waiting for up to 30-seconds before dying and relinquishing control back to your operating-system.
// Call this before your program terminates! // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating, // before the JRE finally kills the monitor-thread. ImageScraper.shutdownTOThreads();
Hi-Lited Source-Code:This File's Source Code:
- View Here: Torello/HTML/Tools/Images/ImageScraper.java
- Open New Browser-Tab: Torello/HTML/Tools/Images/ImageScraper.java
File Size: 13,733 Bytes Line Count: 282 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/BufferedImageToByteArray.java
- Open New Browser-Tab: ../ImageScraper Helpers/BufferedImageToByteArray.java
File Size: 4,200 Bytes Line Count: 114 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/ComputeFileName.java
- Open New Browser-Tab: ../ImageScraper Helpers/ComputeFileName.java
File Size: 2,686 Bytes Line Count: 79 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/ConvertB64Image.java
- Open New Browser-Tab: ../ImageScraper Helpers/ConvertB64Image.java
File Size: 2,978 Bytes Line Count: 83 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/DownloadImage.java
- Open New Browser-Tab: ../ImageScraper Helpers/DownloadImage.java
File Size: 12,800 Bytes Line Count: 308 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/MainLoopBody.java
- Open New Browser-Tab: ../ImageScraper Helpers/MainLoopBody.java
File Size: 4,864 Bytes Line Count: 121 '\n' Characters Found
Static Helper Class:
- View Here: ../ImageScraper Helpers/RECORD.java
- Open New Browser-Tab: ../ImageScraper Helpers/RECORD.java
File Size: 19,303 Bytes Line Count: 493 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctionalAnnotation may also be called 'The Spaghetti Report'.Static-Functionalclasses are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@StatelessAnnotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 3 Method(s), 3 declared static
- 0 Field(s)
-
-
Method Summary
Download & Save a List of Images, using a RequestinstanceModifier and Type Method static Resultsdownload(Request request, Appendable log)Download all Images on a single HTML-Page into a Directory Modifier and Type Method static Ret2<int[],Results>localizeImages(Vector<HTMLNode> page, URL pageURL, Appendable log, String targetDirectory)Finalize class ImageScraper Modifier and Type Method static voidshutdownTOThreads()
-
-
-
Method Detail
-
shutdownTOThreads
public static void shutdownTOThreads()
If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.
Before Exiting:
When a program you have written reaches the end of its code, if you have performed any time-dependent Image-Downloads using this class (classImageScraper), then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread dies.
Note that you may immediately terminate any additional threads that were started using this method.- Code:
- Exact Method Body:
DownloadImage.executor.shutdownNow();
-
localizeImages
public static Ret2<int[],Results> localizeImages (java.util.Vector<HTMLNode> page, java.net.URL pageURL, java.lang.Appendable log, java.lang.String targetDirectory) throws java.io.IOException, ImageScraperException
Downloads images located inside an HTML Page and updates theSRC=...URL'sso that the links point to a local copy of local images.
After completion of this method, an HTML page which contained any HTML image elements will have had those images downloaded to the local file-system, and also have had the HTML attribute'src=...'changed to reflect the local image name instead of the Internet URL name.- Parameters:
page- Any vectorized-html page or subpage. This page should have HTML<IMG ...>elements in it, or else this method will exit without doing anything.pageURL- If any of the HTML image elements havesrc='...'attributes that are partially resolved or relativeURL'sthen this can be passed to theImageScraperconstructors in order to convert partial or relativeURL'sinto completeURL's.The Image Downloader simply cannot work with partially resolvedURL's, and will skip them if they are partially resolved. This parameter may be null, but if it is and there are incomplete-URL'sthose images will simply not be downloaded.log- This is the 'logger' for this method. It may be null, and if it is - no output will be sent to the terminal. This expects an implementation of Java'sjava.lang.Appendableinterface which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'Sends text to the standard-out terminal Torello.Java.StorageWriterSends text to System.out, and saves it, internally.FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes FileOutputStream, PrintStreamMore general-purpose java text-output classes
Checked IOException:
TheAppendableinterface requires that the Checked-ExceptionIOExceptionbe caught when using itsappend(...)methods.targetDirectory- This File-System directory where these files shall be stored.- Returns:
- An instance of
Ret2<int[], Results>. The two returned elements of this class include:-
Ret2.a (int[])
This shall contain an index-array for the indices of each HTML'<IMG SRC=...>'element found on the page. It is not guaranteed that each of images will have been resolved or downloaded successfully, but rather just that an HTML'IMG'element that had a'SRC'attribute. The second element of this return-type will contain information regarding which images downloaded successfully.
-
Ret2.b (Results)
The second element of the return-type shall be the instance ofResultsreturned from the invocation ofImageScraper.download(...). This method will provide details about each of the images that were downloaded; or, if the download failed, the reasons for the failure. This return element shall be null if no images were found on the page.
These returnObjectreferences are not necessarily important - and they may be discarded if needed. They are provided as a matter of utility if further verification or research into successful downloads is needed. -
- Throws:
java.io.IOException- I/O Problems that weren't avoided.ImageScraperException- Thrown for any number of errors that went unsuppressed.- Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Find all of the Image TagNode's on the Input Web-Page // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** final int[] imgPosArr = TagNodeFind.all(page, TC.Both, "img"); final Vector<TagNode> vec = new Vector<>(); // No Images Found. if (imgPosArr.length == 0) return new Ret2<int[], Results>(imgPosArr, null); for (final int pos : imgPosArr) vec.addElement((TagNode) page.elementAt(pos)); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build a Request and Download all of the Image's that were just found / identified // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** final Request request = Request.buildFromTagNodeIter(vec, pageURL, true); request.targetDirectory = targetDirectory; // NOTE: This is NOT FINISHED: // SET ALL OF THE "Skip On Exception" booleans to TRUE!!! // Invoke the Main Image Downloader final Results r = ImageScraper.download(request, log); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Replace the <IMG SRC=...> TagNode URL's for images that were successfully downloaded. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Now replace final ReplaceFunction replacer = (HTMLNode n, int arrPos, int count) -> { if (r.skipped[count] == false) return ((TagNode) page.elementAt(arrPos)) .setAV("src", r.fileNames[count], SD.SingleQuotes); else return (TagNode) n; }; ReplaceNodes.r(page, imgPosArr, replacer); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Report the Results // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return new Ret2<int[], Results>(imgPosArr, r);
-
download
public static Results download(Request request, java.lang.Appendable log) throws ImageScraperException
This will iterate through theURL'sand download them. Note that parameter'log'may be null, and if so, it will be quietly ignored.- Parameters:
request- This parameter takes customization requests for batch image downloads. To read more information about how to configure a download, please review the documentation for the classRequest.
Note that upon entering this method, this parameter is immediately cloned to prevent the possibility of Thread Concurrency Problems from happening. After cloning, the the cloned instance is used exclusively, and the original parameter is discarded. Further changes to the parameter-instance will not have any effect on the process.log- This shall receive text / log information. This parameter may receive null, and if it does it will be ignored. When ignored, logging information will not printed. This expects an implementation of Java'sjava.lang.Appendableinterface which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'Sends text to the standard-out terminal Torello.Java.StorageWriterSends text to System.out, and saves it, internally.FileWriter, PrintWriter, StringWriterGeneral purpose java text-output classes FileOutputStream, PrintStreamMore general-purpose java text-output classes
Checked IOException:
TheAppendableinterface requires that the Checked-ExceptionIOExceptionbe caught when using itsappend(...)methods.- Returns:
- an instance of
class Resultsfor the download. TheResultsclass contains several parallel arrays with information about images that have downloaded. If an image-download happens to fail due to an improperly formedURL(or an 'incorrect'URL), then the information in theResultsarrays will contain a 'null' value for the index at those array-positions corresponding to the failed image. - Throws:
ImageScraperException- Thrown for any number of exceptions that may be thrown while executing the download-loop. If another exception is thrown, then it is wrapped by this class' exception (ImageScraperException), and set as the'cause'of that exception.AppendableError- The interfacejava.lang.Appendablewas designed to allow for an implementation to throw the (unchecked) exceptionIOException. This has many blessings, but can occasionally be a pain since, indeed,IOExceptionis both an unchecked exception (and requires an explicity catch), and also very common (even ubiquitous) inside of HTTP download code.
If the user-provided'log'parameter throws anIOExceptionfor simply trying to write character-data to the log about the download-progress, then anAppendableErrorwill be thrown. Note that this throwable does inheritjava.lang.Error, meaning that it won't be caught by standard Javacatchclauses (unless'Error'is explicity mentioned!)- Code:
- Exact Method Body:
// Clone the Request, Similar to "SafeVarArgs" - Specifically, if the user starts playing // with the contents of this class in the middle of a download, it will not have any effect // on the 'request' object that is actually being used. request = request.clone(); // Runs a few tests to make sure there are no problems using the request request.CHECK(); // Makes log printing easier and easier. final AppendableLog al = new AppendableLog(log, request.verbosity); // Main Request-Configuration and Response Class Instances. final Results results = new Results(request.size); // Private, Internal Static-Class. Makes passing variables even easier final RECORD r = new RECORD(request, results, al); // Now, this just gets rid of the surrounding try-catch block. This is the only real // reason for the internal/private method 'downloadWithoutTryCatch'. This makes the // indentation look a lot better. Also, in this method, the 'log' is replaced with the // AppendableSafe log try { // private static void mainDownloadLoop(RECORD r) throws ImageScraperException // Helps prepare for the printing loop; if (r.logLevelGTEQ1) r.append("\n"); for (URL url : r.request.source()) { r.reset(); r.url = url; MainLoopBody.loop(r); } return results; } catch (ImageScraperException e) { // If an exception causes the system to stop/halt, this extra '\n\n' makes the output // text look a little nicer (sometimes... Sometimes it already looks fine). // No more no less. if (al.hasLog) al.append("\n\nThrowing ImageScraperException...\n"); throw e; }
-
-