Package Torello.HTML.Tools.Images
Class Request
- java.lang.Object
-
- Torello.HTML.Tools.Images.Request
-
- All Implemented Interfaces:
java.io.Serializable,java.lang.Cloneable
public class Request extends java.lang.Object implements java.lang.Cloneable, java.io.Serializable
ImageScraper-Suite Class
TheImageScraperTool itself includes three 'Helper-Classes' that facilitate its operations. These three Helpers include:Request,ResultsandImageInfo.
Building a Request:
Building an Image-DownloadRequestinstance really should be extremely easy, and there is an example of doing just that at the top of theRequestclass. Properly configuring the class to handle any / all possible errors or exceptions that might occur when downloading images from a web-server requires a little reading of the JavaDoc pages provided by these tools.
TheRequestclass includes several boolean's for supressing / skipping exception if they occur during the download loop / process iteration. If an exception is thrown and suppressed, it will simply be logged to theResultsclass.
Once aRequestObject has been built, simply pass that object-instance to theImageScrapermethoddownloadand a download-process will begin.Request'sLambda-Targets:
If theRequestobject contained any Lambda-Target / Function-Pointers, then those Lambda-Methods will be passed instances of the 'Helper-Class'ImageInfowhen they are invoked by the download-loop. These Function-Pointers provide just a few features that allow a programmer to do things like filter-out certain Image-URL'sand also do things like decide where a downloaded Image is ultimately stored.
Finally, when the download-loop has run to completely, it will return an instance of classResults
Getting Results:
After theImageScraper.download(...)loop has run to completion, an instance of classResultswill be returned tot he user, and it will simply contain several parallel-arrays that hold / store data about what transpired when trying to download each of the Image-URL'swhich were passed to theRequest-Object.
For instance the 'skipped' array will indicate which pictures didn't download. The 'fileNames' array will hold the name of the file of each image that was successfully downloaded. And the 'imageFormats' will identify which format was ultimately decided-upon when saving the image.
Remember that each of these return-arrays are parallel to eachother, and (or course) will be identical in length. Furthermore, as per the definition of "Parallel-Arrays", the element residing at any index will always correspond to the same image in any one of the other arrays.Holds all relevant configurations and parameters needed to run the primary download-loop of classImageScraper
The class holds numerous configuration parameters that are provided to theImageScraperwhen initiating an image-download. The available configurations include:-
Severl Exception-Suppression
'skipOn' boolean'sthat allow a user to instruct theImageScraper'sdownload-loop not to throw exceptions - but rather suppress them and save them to theResultsinstance.
-
Two Time-Out Fields that may be configured which instruct the downloader on the maximum wait
time that an image should be allowed before halting the process. When this time-limit is
reached, the loop will either throw an exception, or suppress the exception and move on to the
next image, dependent upon the value assigned to
'
skipOnTimeOutException'
-
Possibly configuring an HTTP
'User-Agent'when downloading images. The'User-Agent'was a feature designed by Web-Servers to allow a Web-Browser to identify itself when attempting to communicate to a server.
-
Several configurations are offered that allow a user to specify where and how an image is
saved to disk.
-
Finally, there are two Java
Predicate's(Lambda-Targets) that allow a user to specify whether an image is saved or even downloaded. The Predicate 'skipURL' is a way to inform the main download-loop that an image should not even be downloaded in the first place. The Predicate 'keeperPredicate' allows a user to specify whether or not an image should be saved to disk after it has been downloaded, and all of the information about the image is known.
Example:
URL url = new URL("http://news.yahoo.com/"); Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false); // Download a Web-Page, and extract all Image-Elements List<String> images = InnerTagGet .all(page, "img", "src") .stream() .map((TagNode tn) -> tn.AV("src")) .filter((String src) -> src.length() > 0) .toList(); // Build a Request Object using the Images-as-String's List Request req = Request.buildFromStrIter(images, url, true); // Add a few more Scraper-Configurations req.targetDirectory = "data/MyWebPages/Page01/"; req.useDefaultCounterForImageFileNames = true; req.skipOnDownloadException = true; req.verbosity = Verbosity.Verbose; // Run the scraper, Send all Text-Output to 'System.out' Results results = ImageScraper.download(req, System.out);
- See Also:
- Serialized Form
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/Images/Request.java
- Open New Browser-Tab: Torello/HTML/Tools/Images/Request.java
File Size: 27,884 Bytes Line Count: 580 '\n' Characters Found
-
-
Field Summary
Serializable ID Modifier and Type Field static longserialVersionUIDPrimary / Core Download Fields Modifier and Type Field URLoriginalPageURLintsizeFunction<URL,URL>urlPreProcessorVerbosityverbosityLocation-Decisions for Saving an Image File Modifier and Type Field Consumer<ImageInfo>imageReceiverStringtargetDirectoryFunction<ImageInfo,File>targetDirectoryRetrieverFile-Name given to an Image File Modifier and Type Field StringfileNamePrefixFunction<ImageInfo,String>getImageFileSaveNamebooleanuseDefaultCounterForImageFileNamesBooleans for Deciding When to Continue on Exception / Failure or Throw Modifier and Type Field booleanskipBase64EncodedImagesbooleanskipOnB64DecodeExceptionbooleanskipOnDownloadExceptionbooleanskipOnImageWritingFailbooleanskipOnNullImageExceptionbooleanskipOnTimeOutExceptionbooleanskipOnUserLambdaExceptionPredicates for Deciding Which Image Files to Save, and Which to Skip Modifier and Type Field Predicate<ImageInfo>keeperPredicatePredicate<URL>skipURLAvoiding Hangs & Locks with a TimeOut Modifier and Type Field static longMAX_WAIT_TIMEstatic TimeUnitMAX_WAIT_TIME_UNITlongmaxDownloadWaitTimeTimeUnitwaitTimeUnitsApplying a User-Agent when Downloading Images Modifier and Type Field booleanalwaysUseUserAgentstatic StringDEFAULT_USER_AGENTbooleanretryWithUserAgentStringuserAgent
-
Method Summary
Static Request-Builders: URL-as-StringIterableModifier and Type Method static RequestbuildFromStrIter(Iterable<String> source)static RequestbuildFromStrIter(Iterable<String> source, URL originalPageURL, boolean skipDontThrowIfBadStr)Static Request-Builders: <IMG SRC=...>-TagNodeIterableModifier and Type Method static RequestbuildFromTagNodeIter(Iterable<TagNode> source, boolean skipDontThrowIfBadSRCAttr)static RequestbuildFromTagNodeIter(Iterable<TagNode> source, URL originalPageURL, boolean skipDontThrowIfBadSRCAttr)Static Request-Builders: URLIterableModifier and Type Method static RequestbuildFromURLIter(Iterable<URL> source)Simple Convenience-Method Setter Modifier and Type Method voidskipOnAllExceptions()Methods: interface java.lang.Cloneable Modifier and Type Method Requestclone()Methods: class java.lang.Object Modifier and Type Method StringtoString()
-
-
-
Field Detail
-
serialVersionUID
public static final long serialVersionUID
This fulfils the SerialVersion UID requirement for all classes that implement Java'sinterface java.io.Serializable. Using theSerializableImplementation offered by java is very easy, and can make saving program state when debugging a lot easier. It can also be used in place of more complicated systems like "hibernate" to store data as well.- See Also:
- Constant Field Values
-
originalPageURL
public final java.net.URL originalPageURL
URLfrom whence this page has been downloaded
-
size
public final int size
The number of Image-URL'sidentified inside the'source'Iterable.
-
verbosity
public Verbosity verbosity
Allows a user of this utility to specify how the Level of Verbosity (or silence) is applied to the output mechanism while the tool is running.
Note that the JavaenumVerbosityprovides four distint levels, and that the classImageScraperdoes indeed implement all four variants of textual output.NullPointerException: This field may not be null, or aNullPointerException: will throw!
-
urlPreProcessor
public java.util.function.Function<java.net.URL,java.net.URL> urlPreProcessor
When non-null, this allows a user to modify any image-URLimmediately-prior to theImageScraperbeginning the download process for that image. This is likely of limited use, but there are certainly situations where (for example) escaped-characters need to be un-escaped prior to starting the download system.
In such cases, just write a lambda-target that accepts aURL, and processes it (in some way, of your chossing), and the downloader will use that updatedURL-instance for making the HTTP-Connection to download the picture.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
targetDirectoryRetriever
public java.util.function.Function<ImageInfo,java.io.File> targetDirectoryRetriever
Allows a user to specify where to save an Image-File after being downloaded.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
imageReceiver
public java.util.function.Consumer<ImageInfo> imageReceiver
A functional-interface that allows a user to save an image-file to a location of his or her choosing. Implement this class if saving image files to a target-directory on the file-system is not acceptable, and the programmer wishes to do something else with the downloaded images.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
targetDirectory
public java.lang.String targetDirectory
When this configuration-field is non-null, thisStringparameter is used to identify the file-system directory to where downloaded image-files are to be saved.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
fileNamePrefix
public java.lang.String fileNamePrefix
When this field is non-null, thisStringwill be prepended to each image file-name that is saved or stored to the file-system.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
useDefaultCounterForImageFileNames
public boolean useDefaultCounterForImageFileNames
When true, images will be saved according to a counter; when this isFALSE, the software will attempt to save these images using their original filenames - picked from the URL. Saving using a counter is the default behaviour for this class.
-
getImageFileSaveName
public java.util.function.Function<ImageInfo,java.lang.String> getImageFileSaveName
When this field is non-null, each time an image is written to the file-system, this function will be queried for a file-name before writing the the image-file.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
skipOnDownloadException
public boolean skipOnDownloadException
Requests that the downloader-logic catch any & all exceptions that are thrown when downloading images from an Internet-URL. The failed download is simply reflected in theResultsoutput arrays, and the download-process moves onto the nextIterable-Element.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown while Java'sImageIOclass is downloading and image, and suddenly fails.
-
skipOnB64DecodeException
public boolean skipOnB64DecodeException
Requests that the downloader-logic catch any & all exceptions that are thrown when decoding Base-64 Encoded Images. The failed conversion is simply reflected in theResultsoutput arrays, and the download-process moves onto the nextIterable-Element.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown when Java's Base-64 Image-Decoder throws an exception.
-
skipOnTimeOutException
public boolean skipOnTimeOutException
Requests that the downloader-logic catch any & all exceptions that are thrown while waiting for an image to finish downloading from an Internet-URL. The failed download is simply reflected in theResultsoutput arrays, and the download-process moves onto the nextIterable-Element.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown when the Monitor-Thread has timed-out.
-
skipOnNullImageException
public boolean skipOnNullImageException
There are occasions when Java'sImageIOclass returns a null image, rather than throwing an exception at all. In these cases, theImageScraperclass throws its own exception - unless thisbooleanhas expressly requested to skip-and-move-on when theImageIOreturns null from downloading aURL.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown by theImageScraperwhen a downloaded image is null.
-
skipOnImageWritingFail
public boolean skipOnImageWritingFail
If an attempt is made to write an Image to the File-System, and an exception is thrown, this boolean requests that rather than throwing the exception, the downloader make a note in the log that a failure occured, and move on to the next image.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown when writing an already downloaded and converted image to the file-system.
-
skipOnUserLambdaException
public boolean skipOnUserLambdaException
This can be helpful if there are any "doubts" about the quality of the Functional-Interfaces that have been provided to thisRequest-instance.Exception's Skipped: This particular configuration-booleanallows a user to focus on exceptions that are thrown by any of the Lambda-Target / Functional-Interfaces that are provided by the user via thisRequest-instance.
-
skipURL
public java.util.function.Predicate<java.net.URL> skipURL
If this field is non-null, then before anyURLis connected for a download, the downloaded mechanism will ask thisURL-Predicatefor permission first. If thisPredicatereturnsFALSEfor a givenURL, then that image will not be downloaded, but rather skipped, instead.Setting to null: This field may be null, and when it is, it shall be ignored. Upon construction, this class initializes this field to null.
-
skipBase64EncodedImages
public boolean skipBase64EncodedImages
This scraper has the ability to decode and saveBase-64Images, and they may be downloaded or skipped - based on thisboolean. If anIterable<TagNode>is passed to the constructor, and one of thoseTagNode'scontain an Image Element (<IMG SRC="data:image/jpeg;base64,...data">) this class has the ability to interpret and save the image to a regular image file. By default,Base-64images are skipped, but they can also be downloaded as well.
-
keeperPredicate
public java.util.function.Predicate<ImageInfo> keeperPredicate
Allows for a user-provided decision-predicate about whether to retain & save, or discard, an image after downloading. All information available in data-flow classImageInfois provided to this predicate, and ought to be enough to decide whether toor not to save or reject any of the downloaded image.
-
MAX_WAIT_TIME
public static final long MAX_WAIT_TIME
This is the default maximum wait time for an image to download ({@value}). This value may be reset or modified by instantiating aImageScraper.AdditionalParametersclass, and passing the desired values to the constructor. This value is measured in units ofpublic static final java.util.concurrent.TimeUnit MAX_WAIT_TIME_UNIT
-
MAX_WAIT_TIME_UNIT
public static final java.util.concurrent.TimeUnit MAX_WAIT_TIME_UNIT
This is the default measuring unit for thestatic final long MAX_WAIT_TIMEmember. This value may be reset or modified by instantiating aImageScraper.AdditionalParametersclass, and passing the desired values to the constructor.- See Also:
MAX_WAIT_TIME,waitTimeUnits
-
maxDownloadWaitTime
public long maxDownloadWaitTime
If you do not want the downloader to hang on an image, which is sometimes an issue depending upon the site from which you are making a request, set this parameter, and the downloader will not wait past that amount of time to download an image. The default value for this parameter is10 seconds. If you do not wish to set the max-wait-time "the download time-out" counter, then leave the parameter"waitTimeUnits"set tonull, and this parameter will be ignored.
-
waitTimeUnits
public java.util.concurrent.TimeUnit waitTimeUnits
This is the "unit of measurement" for the fieldlong maxDownloadWaitTime.
NOTE: This parameter may benull, and if it is both this parameter and the parameterlong maxDownloadWaitTimewill be ignored, and the default maximum-wait-time (download time-out settings) will be used instead.Read:java.util.concurrent.*;package, and about theclass java.util.concurrent.TimeUnitfor more information.
-
DEFAULT_USER_AGENT
public static final java.lang.String DEFAULT_USER_AGENT
There are web-sites that expect a User-Agent to be defined before allowing an image download to progress. There are even web-sites and servers that simply will not connect to a scraper unless a User-Agent is defined.
This is the default User-Agent that is defined inside of classScrape.
-
userAgent
public java.lang.String userAgent
This is the User-Agent. Much literature is available on the Internet which can explain how this field operates.
-
alwaysUseUserAgent
public boolean alwaysUseUserAgent
Flag which informs the scraper whether or not the User-Agent is mandatory.
-
retryWithUserAgent
public boolean retryWithUserAgent
Flag which requests automatically retrying a download after failure by using a "User Agent"
-
-
Method Detail
-
buildFromStrIter
public static Request buildFromStrIter (java.lang.Iterable<java.lang.String> source)
Builds an instance of this class from a list ofURL'sasString's- Parameters:
source- Accepts any JavaIterablecontainingString's. Note that if any of theseString'sare malformedURL's, then this method will throw anIllegalArgumentException, with a JavaMalformedURLExceptionas its cause.
Furthermore, if any of theseString'scontain partially resolvedURL's, this will also force this method to throw anIllegalArgumentException.- Returns:
- A
'Request'instance. This may be further configured by assigning values to any / all fields (which will still have their initialized / default-values) - Throws:
java.lang.NullPointerException- If any of theString'sin theIterableare nulljava.lang.IllegalArgumentException- If any of theURL'sareString'swhich begin with neither'http://'nor'https://'. Since this method doesn't accept the parameter'originalPageURL', each and everyURLin the'source'iterable must be a full & completeURL.
This exception will also throw if there are anyURL'sin theString-List that cause aMalformedURLExceptionto throw when constructing an instance ofjava.net.URLfrom theString. In these cases, the originalMalformedURLExceptionwill be assigned to the'cause', and may be retrieved using the exception'sgetCause()method.- Code:
- Exact Method Body:
return FromStringIterator.build(source);
-
buildFromStrIter
public static Request buildFromStrIter (java.lang.Iterable<java.lang.String> source, java.net.URL originalPageURL, boolean skipDontThrowIfBadStr)
Builds an instance of this class from a list ofURL'sasString's- Parameters:
source- Accepts any JavaIterablecontainingString's. Note that if any of theseString'sare malformedURL's, then this method will throw anIllegalArgumentException, with a JavaMalformedURLExceptionas its cause.
Furthermore, if any of theseString'scontain partially resolvedURL's, this will also force this method to throw anIllegalArgumentException.originalPageURL- This parameter may not be null, or aNullPointerExceptionwill throw. This parameter is used to help any partially resolved-URL's.skipDontThrowIfBadStr- If an exception is thrown when attempting to resolve a partial-URL, and this parameter isTRUE, then that exception is suppressed and logged, and the builder-loop continues to the nextURL-as-a-String.
When this parameter is passedFALSE, unresolvableURL'swill generate anIllegalArgumentException-throw.
Note that the presence of a null in theIterable 'source'parameter will always force this method to throwNullPointerException.- Returns:
- A
'Request'instance. This may be further configured by assigning values to any / all fields (which will still have their initialized / default-values) - Throws:
java.lang.NullPointerException- If any of theString'sin theIterableare nulljava.lang.IllegalArgumentException- This exception will also throw if there are anyURL'sin theString-List that cause aMalformedURLExceptionto throw when constructing an instance ofjava.net.URLfrom theString. In these cases, the generatedMalformedURLExceptionwill be assigned to the exception's'cause', and may therefore be retrieved using this exception'sgetCause()method.- Code:
- Exact Method Body:
return FromStringIterator.build(source, originalPageURL, skipDontThrowIfBadStr);
-
buildFromTagNodeIter
public static Request buildFromTagNodeIter (java.lang.Iterable<TagNode> source, java.net.URL originalPageURL, boolean skipDontThrowIfBadSRCAttr)
Builds an instance of this class using theSRC-Attribute from a list ofTagNode's.- Parameters:
source- Accepts any JavaIterablecontainingTagNodeinstances. If any of theseTagNode'sdo not have an HTML-SRCAttribute, then this builder method will immediately throw aSRCException.originalPageURL- This parameter may not be null, or aNullPointerExceptionwill throw. This parameter is used to help any partially resolved-URL's.skipDontThrowIfBadSRCAttr- When this parameter is passedTRUE, if any of theTagNodeelements inside'source'either do not have aSRC-Attribute, or have such an attribute containing an invalidURLthat cannot be instantiated, then this element of the iterable will simply be skipped, gracefully.
When this parameter is passedFALSE, if any of the above two conditions / situations arise, then an exception will be thrown (this builder method will not run until completion).
Note that the presence of a null inside theIterable 'source'parameter will always force this method to throwNullPointerException.- Returns:
- A
'Request'instance. This may be further configured by assigning values to any / all fields (which will still have their initialized / default-values) - Throws:
java.lang.NullPointerException- If any of theTagNode's in theIterableare nullSRCException- If any of theTagNode's in the list do not have a'SRC'Attribute, and'skipDontThrowIfBadSRCAttr'isFALSE.
This exception will also throw if there are anyURL'sin theTagNode-List that cause aMalformedURLExceptionto throw when constructing an instance ofjava.net.URL(from theTagNode's SRC-Attribute). In these cases, the generatedMalformedURLExceptionwill be assigned to the exception's'cause', and may therefore be retrieved using the exception'sgetCause()method.
If'skipDontThrowIfBadSRCAttr'isFALSE, then this exception will not throw, and a null will be placed in the query-list.- Code:
- Exact Method Body:
return FromTagNodeIterator.build(source, originalPageURL, skipDontThrowIfBadSRCAttr);
-
buildFromTagNodeIter
public static Request buildFromTagNodeIter (java.lang.Iterable<TagNode> source, boolean skipDontThrowIfBadSRCAttr)
Builds an instance of this class using theSRC-Attribute from a list ofTagNode's.- Parameters:
source- Accepts any JavaIterablecontainingTagNodeinstances. If any of theseTagNode'sdo not have an HTML-SRCAttribute, then this builder method will immediately throw aSRCException.skipDontThrowIfBadSRCAttr- When this parameter is passedTRUE, if any of theTagNodeelements inside'source'either do not have aSRC-Attribute, or have such an attribute containing an invalidURLthat cannot be instantiated, then this element of the iterable will simply be skipped, gracefully.
When this parameter is passedFALSE, if any of the above two conditions / situations arise, then an exception will be thrown (this builder method will not run until completion).
Note that the presence of a null inside theIterable 'source'parameter will always force this method to throwNullPointerException.- Returns:
- A
'Request'instance. This may be further configured by assigning values to any / all fields (which will still have their initialized / default-values) - Throws:
java.lang.NullPointerException- If any of theTagNode's in theIterableare nullSRCException- If any of theTagNode's in the list do not have a'SRC'-Attribute, and'skipDontThrowIfBadSRCAttr'isFALSE.
This exception will also throw if any of theURL'sassigned to a'SRC'-Attribute are partial-URL'swhich do not begin with'http://'(or'https://'), and'skipDontThrowIfBadSRCAttr'isFALSE.
Finally, if any of theURL'sinside aTagNode's''SRC'-Attribute cause aMalformedURLException, that exception will be assigned to thecauseof aSRCException, and thrown (unless'skipDontThrowIfBadSRCAttr'isFALSE).- Code:
- Exact Method Body:
return FromTagNodeIterator.build(source, skipDontThrowIfBadSRCAttr);
-
buildFromURLIter
public static Request buildFromURLIter (java.lang.Iterable<java.net.URL> source)
Builds an instance of this class using a list of already preparedURL's.- Parameters:
source- Accepts any JavaIterablecontainingURL's. Note that if any of theseURL'sare not fully resolved, when downloading begins, any un-resolved ones will cause an exception to throw.- Returns:
- A
'Request'instance. This may be further configured by assigning values to any / all fields (which will still have their initialized / default-values) - Throws:
java.lang.NullPointerException- If any of theURL'sin theIterableare null- Code:
- Exact Method Body:
return FromURLIterator.build(source);
-
skipOnAllExceptions
public void skipOnAllExceptions()
This allows a user to quickly / easily set all'skipOn'flags in one method call- Code:
- Exact Method Body:
// exceptions thrown by Java's ImageIO class when downloading and image skipOnDownloadException = // if Java's Base-64 Image-Decoder throws an exception. skipOnB64DecodeException = // exception that's thrown when the Monitor-Thread has timed-out. skipOnTimeOutException = // exception that's thrown when a downloaded image is null. skipOnNullImageException = // exceptions thrown when writing an already downloaded image to the file-system. skipOnImageWritingFail = // exceptions thrown by any of the User-Provided Lambda-Target / Functional-Interfaces skipOnUserLambdaException = true;
-
toString
public java.lang.String toString()
Converts'this'instance into a simple Java-String- Overrides:
toStringin classjava.lang.Object- Returns:
- A
Stringwhere each field has had a 'best efforts'String-Conversion - Code:
- Exact Method Body:
return RequestToString.toString(this);
-
clone
public Request clone()
Builds a clone of'this'instance- Overrides:
clonein classjava.lang.Object- Returns:
- The copied instance. Note that this is a shallow clone,
rather than a deep clone. The references within the returned
instances are the exact same references as are in
'this'instance. - Code:
- Exact Method Body:
final Request cloned = new Request(this); RequestClone.copy(this, cloned); return cloned;
-
-