Package Torello.HTML.Tools.Images
Class ImageScraper
- java.lang.Object
-
- Torello.HTML.Tools.Images.ImageScraper
-
public class ImageScraper extends java.lang.Object
ImageScraper-Suite Class
TheImageScraper
Tool itself includes three 'Helper-Classes' that facilitate its operations. These three Helpers include:Request
,Results
andImageInfo
.
Building a Request:
Building an Image-DownloadRequest
instance really should be extremely easy, and there is an example of doing just that at the top of theRequest
class. Properly configuring the class to handle any / all possible errors or exceptions that might occur when downloading images from a web-server requires a little reading of the JavaDoc pages provided by these tools.
TheRequest
class includes several boolean's for supressing / skipping exception if they occur during the download loop / process iteration. If an exception is thrown and suppressed, it will simply be logged to theResults
class.
Once aRequest
Object has been built, simply pass that object-instance to theImageScraper
methoddownload
and a download-process will begin.Request's
Lambda-Targets:
If theRequest
object contained any Lambda-Target / Function-Pointers, then those Lambda-Methods will be passed instances of the 'Helper-Class'ImageInfo
when they are invoked by the download-loop. These Function-Pointers provide just a few features that allow a programmer to do things like filter-out certain Image-URL's
and also do things like decide where a downloaded Image is ultimately stored.
Finally, when the download-loop has run to completely, it will return an instance of classResults
Getting Results:
After theImageScraper.download(...)
loop has run to completion, an instance of classResults
will be returned tot he user, and it will simply contain several parallel-arrays that hold / store data about what transpired when trying to download each of the Image-URL's
which were passed to theRequest
-Object.
For instance the 'skipped
' array will indicate which pictures didn't download. The 'fileNames
' array will hold the name of the file of each image that was successfully downloaded. And the 'imageFormats
' will identify which format was ultimately decided-upon when saving the image.
Remember that each of these return-arrays are parallel to eachother, and (or course) will be identical in length. Furthermore, as per the definition of "Parallel-Arrays", the element residing at any index will always correspond to the same image in any one of the other arrays.A more advanced class for both downloading and saving a list of images, using URL's.
ImageScraper (previously called"ImageScrape2"
allows more fine-grained control for how the images are saved and downloaded. Though this class seems extremely complicated, parameter-wise, ultimately these allow many alternate versions of what to do with downloaded images, where to save them, and how to name them. It even can deal with "Base-64 String Encoded Images" (images which are encoded using the<IMG SRC="data:image/jpeg...">
SRC-Attribute) with ease.
Prevent Hangs with TimeOut:
Note that this class uses monitor threads to ensure that image-downloads do not exceed a certain wait time. You may modify this Maximum Wait-Time using the parameters in classRequest
. This class (classImageScraper
) is thread-safe class. It does not use any global or static-global variables - except a Thread Thread-Pool'Executor'
. The'Executor'
is locked in a Thread-Safe manner using a Semaphore-Lock fromjava.util.concurrent.locks
.
Spawns a Monitor Thread:
If this class is used by a programmer, when the programmer's program is ready to exit, he or she might not see his program exit immediately. Java's class'Executors'
builds a thread-pool, and a time-out thread. This time-out thread stays alive (but unused) most of the time.
If you have used this class, make sure to call the following method before your program completes, or you may find it idly-waiting for up to 30-seconds before dying and relinquishing control back to your operating-system.
// Call this before your program terminates! // Otherwise your program may HANG-IDLE for up to 30 seconds when terminating, // before the JRE finally kills the monitor-thread. ImageScraper.shutdownTOThreads();
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/Images/ImageScraper.java
- Open New Browser-Tab: Torello/HTML/Tools/Images/ImageScraper.java
File Size: 61,501 Bytes Line Count: 1,435 '\n' Characters Found
Stateless Class:This class neither contains any program-state, nor can it be instantiated. The@StaticFunctional
Annotation may also be called 'The Spaghetti Report'.Static-Functional
classes are, essentially, C-Styled Files, without any constructors or non-static member fields. It is a concept very similar to the Java-Bean's@Stateless
Annotation.
- 1 Constructor(s), 1 declared private, zero-argument constructor
- 14 Method(s), 14 declared static
- 3 Field(s), 3 declared static, 3 declared final
-
-
Method Summary
Download & Save a List of Images, using a Request
instanceModifier and Type Method static Results
download(Request request, Appendable log)
Download all Images on a single HTML-Page into a Directory Modifier and Type Method static Ret2<int[],Results>
localizeImages(Vector<HTMLNode> page, URL pageURL, Appendable log, String targetDirectory)
Finalize class ImageScraper Modifier and Type Method static void
shutdownTOThreads()
-
-
-
Method Detail
-
shutdownTOThreads
public static void shutdownTOThreads()
If this class has been used to make "multi-threaded" calls that use a Time-Out wait-period, you might see your Java-Program hang for a few seconds when you would expect it to exit back to your O.S. normally.
Before Exiting:
When a program you have written reaches the end of its code, if you have performed any time-dependent Image-Downloads using this class (classImageScraper
), then your program might not exit immediately, but rather sit at the command-prompt for anywhere between 10 and 30 seconds before this Timeout-Thread dies.
Note that you may immediately terminate any additional threads that were started using this method.
-
localizeImages
public static Ret2<int[],Results> localizeImages (java.util.Vector<HTMLNode> page, java.net.URL pageURL, java.lang.Appendable log, java.lang.String targetDirectory) throws java.io.IOException, ImageScraperException
Downloads images located inside an HTML Page and updates theSRC=...
URL's
so that the links point to a local copy of local images.
After completion of this method, an HTML page which contained any HTML image elements will have had those images downloaded to the local file-system, and also have had the HTML attribute'src=...'
changed to reflect the local image name instead of the Internet URL name.- Parameters:
page
- Any vectorized-html page or subpage. This page should have HTML<IMG ...>
elements in it, or else this method will exit without doing anything.pageURL
- If any of the HTML image elements havesrc='...'
attributes that are partially resolved or relativeURL's
then this can be passed to theImageScraper
constructors in order to convert partial or relativeURL's
into completeURL's.
The Image Downloader simply cannot work with partially resolvedURL's
, and will skip them if they are partially resolved. This parameter may be null, but if it is and there are incomplete-URL's
those images will simply not be downloaded.log
- This is the 'logger' for this method. It may be null, and if it is - no output will be sent to the terminal. This expects an implementation of Java'sjava.lang.Appendable
interface which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
Checked IOException:
TheAppendable
interface requires that the Checked-ExceptionIOException
be caught when using itsappend(...)
methods.targetDirectory
- This File-System directory where these files shall be stored.- Returns:
- An instance of
Ret2<int[], Results>
. The two returned elements of this class include:-
Ret2.a (int[])
This shall contain an index-array for the indices of each HTML'<IMG SRC=...>'
element found on the page. It is not guaranteed that each of images will have been resolved or downloaded successfully, but rather just that an HTML'IMG'
element that had a'SRC'
attribute. The second element of this return-type will contain information regarding which images downloaded successfully.
-
Ret2.b (Results)
The second element of the return-type shall be the instance ofResults
returned from the invocation ofImageScraper.download(...)
. This method will provide details about each of the images that were downloaded; or, if the download failed, the reasons for the failure. This return element shall be null if no images were found on the page.
These returnObject
references are not necessarily important - and they may be discarded if needed. They are provided as a matter of utility if further verification or research into successful downloads is needed. -
- Throws:
java.io.IOException
- I/O Problems that weren't avoided.ImageScraperException
- Thrown for any number of errors that went unsuppressed.- Code:
- Exact Method Body:
// *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Find all of the Image TagNode's on the Input Web-Page // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** int[] imgPosArr = TagNodeFind.all(page, TC.Both, "img"); Vector<TagNode> vec = new Vector<>(); // No Images Found. if (imgPosArr.length == 0) return new Ret2<int[], Results>(imgPosArr, null); for (int pos : imgPosArr) vec.addElement((TagNode) page.elementAt(pos)); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Build a Request and Download all of the Image's that were just found / identified // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** Request request = Request.buildFromTagNodeIter(vec, pageURL, true); request.targetDirectory = targetDirectory; // NOTE: This is NOT FINISHED: // SET ALL OF THE "Skip On Exception" booleans to TRUE!!! // Invoke the Main Image Downloader Results r = ImageScraper.download(request, log); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Replace the <IMG SRC=...> TagNode URL's for images that were successfully downloaded. // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Now replace ReplaceFunction replacer = (HTMLNode n, int arrPos, int count) -> { if (r.skipped[count] == false) return ((TagNode) page.elementAt(arrPos)) .setAV("src", r.fileNames[count], SD.SingleQuotes); else return (TagNode) n; }; ReplaceNodes.r(page, imgPosArr, replacer); // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** // Report the Results // *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** return new Ret2<int[], Results>(imgPosArr, r);
-
download
public static Results download(Request request, java.lang.Appendable log) throws ImageScraperException
This will iterate through theURL's
and download them. Note that parameter'log'
may be null, and if so, it will be quietly ignored.- Parameters:
request
- This parameter takes customization requests for batch image downloads. To read more information about how to configure a download, please review the documentation for the classRequest
.
Note that upon entering this method, this parameter is immediately cloned to prevent the possibility of Thread Concurrency Problems from happening. After cloning, the the cloned instance is used exclusively, and the original parameter is discarded. Further changes to the parameter-instance will not have any effect on the process.log
- This shall receive text / log information. This parameter may receive null, and if it does it will be ignored. When ignored, logging information will not printed. This expects an implementation of Java'sjava.lang.Appendable
interface which allows for a wide range of options when logging intermediate messages.Class or Interface Instance Use & Purpose 'System.out'
Sends text to the standard-out terminal Torello.Java.StorageWriter
Sends text to System.out
, and saves it, internally.FileWriter, PrintWriter, StringWriter
General purpose java text-output classes FileOutputStream, PrintStream
More general-purpose java text-output classes
Checked IOException:
TheAppendable
interface requires that the Checked-ExceptionIOException
be caught when using itsappend(...)
methods.- Returns:
- an instance of
class Results
for the download. TheResults
class contains several parallel arrays with information about images that have downloaded. If an image-download happens to fail due to an improperly formedURL
(or an 'incorrect'URL
), then the information in theResults
arrays will contain a 'null' value for the index at those array-positions corresponding to the failed image. - Throws:
ImageScraperException
- Thrown for any number of exceptions that may be thrown while executing the download-loop. If another exception is thrown, then it is wrapped by this class' exception (ImageScraperException
), and set as the'cause'
of that exception.AppendableError
- The interfacejava.lang.Appendable
was designed to allow for an implementation to throw the (unchecked) exceptionIOException
. This has many blessings, but can occasionally be a pain since, indeed,IOException
is both an unchecked exception (and requires an explicity catch), and also very common (even ubiquitous) inside of HTTP download code.
If the user-provided'log'
parameter throws anIOException
for simply trying to write character-data to the log about the download-progress, then anAppendableError
will be thrown. Note that this throwable does inheritjava.lang.Error
, meaning that it won't be caught by standard Javacatch
clauses (unless'Error'
is explicity mentioned!)- Code:
- Exact Method Body:
// Clone the Request, Similar to "SafeVarArgs" - Specifically, if the user starts playing // with the contents of this class in the middle of a download, it will not have any effect // on the 'request' object that is actually being used. request = request.clone(); // Runs a few tests to make sure there are no problems using the request request.CHECK(); // Makes log printing easier and easier. AppendableLog al = new AppendableLog(log, request.verbosity); // Main Request-Configuration and Response Class Instances. Results results = new Results(request.size); // Private, Internal Static-Class. Makes passing variables even easier RECORD r = new RECORD(request, results, al); // Now, this just gets rid of the surrounding try-catch block. This is the only real // reason for the internal/private method 'downloadWithoutTryCatch'. This makes the // indentation look a lot better. Also, in this method, the 'log' is replaced with the // AppendableSafe log try { mainDownloadLoop(r); return results; } catch (ImageScraperException e) { // If an exception causes the system to stop/halt, this extra '\n\n' makes the output // text look a little nicer (sometimes... Sometimes it already looks fine). // No more no less. if (al.hasLog) al.append("\n\nThrowing ImageScraperException...\n"); throw e; }
-
-