Package Torello.HTML.Tools.Images
Class ImageScrape
- java.lang.Object
-
- Torello.HTML.Tools.Images.ImageScrape
-
public class ImageScrape extends java.lang.Object
A simple class for scraping & downloading images using a URL, or list of URL's.
This class essentially can handlejava.util.Vector<String>
filled with HTTP-URL's that contain pointers/URL's
to photo-images on internet web-pages, and then download them to the local directory. It can keep the original names, or generate simpler, easier to use 'pre-numbered' names. When I am dealing with photo-news images, the file-names for most of the pictures I download are generated by a random number generator, so as images are downloaded, they are simply renamed to'001.jpg', '002.jpg', '003.gif'
etc...
This class also "deciphers" the difference between a.jpg, .png, .gif, .bmp
and.jpeg
, easily, and tries to guess what the file type is based on the file-name extension. There are sites the occasionally have inaccurate file-name extension; and when a save fails, this class will attempt to save the image using the other available file-codecs - until all until the image-file has successfully saved, or all image-formats have failed.
LEGACY NOTE:
Theclass ImageScraper
can handle quite a few more variable-situations; for instance, how the images that are downloaded are numbered, where they are stored, and how exceptions are handled (preventing batch jobs from failing due to a single failed-download). This class was an earlier version of the robustclass ImageScraper
- and due to its ease-of-use, it shall remain available here.
Hi-Lited Source-Code:- View Here: Torello/HTML/Tools/Images/ImageScrape.java
- Open New Browser-Tab: Torello/HTML/Tools/Images/ImageScrape.java
File Size: 16,941 Bytes Line Count: 371 '\n' Characters Found
-
-
Field Summary
Fields Modifier and Type Field static String[]
imageExts
-
Method Summary
Download an Image Modifier and Type Method static String
downloadImageGuessType(String urlStr, String outputFileStr)
static String
downloadImageGuessType(String urlStr, String outputFileStr, String outputDirectory)
static void
getImage(String urlStr, String outputFileStr, String extensionStr)
Download a List of Images Modifier and Type Method static Vector<String>
downloadImagesGuessTypes(Iterable<String> urls)
static Vector<String>
downloadImagesGuessTypes(Iterable<String> urls, String outputDirectory)
static Vector<String>
downloadImagesGuessTypes(String rootURL, Iterable<String> urls)
static Vector<String>
downloadImagesGuessTypes(String rootURL, Iterable<String> urls, String outputDirectory)
Download a List of Images, Listed in a File Modifier and Type Method static Vector<String>
downloadImages(File f)
Extract Image-Type from File Extension Modifier and Type Method static String
getImageTypeFromURL(String urlStr)
-
-
-
Field Detail
-
imageExts
public static final java.lang.String[] imageExts
String
-Array having the list of file-formats
-
-
Method Detail
-
getImageTypeFromURL
public static java.lang.String getImageTypeFromURL (java.lang.String urlStr)
This will extract the file-extension from an imageURL.
Not all images on the internet haveURL's
that end with the actual image-file-type. In that case, or in the case that the'urStr'
is a pointer to a non-image-file, null will be returned.- Parameters:
urlStr
- Is theurl
of the image.- Returns:
- If it has a file-extension that is listed in the
'imageExts'
array - that file-extension will be returned, otherwisenull
will be returned. - Code:
- Exact Method Body:
if (urlStr == null) return null; String ext = StringParse.fromExtension(urlStr, false); if (ext == null) return null; ext = ext.toLowerCase(); for (int i=0; i < imageExts.length; i++) if (imageExts[i].equals(ext)) return imageExts[i]; return null;
-
downloadImageGuessType
public static java.lang.String downloadImageGuessType (java.lang.String urlStr, java.lang.String outputFileStr) throws java.io.IOException
- Code:
- Exact Method Body:
// We need to check whether the file-name that was passed is just a filename; or if it // has a directory component in its name. int sep = outputFileStr.lastIndexOf(File.separator) + 1; if (sep == 0) return downloadImageGuessType(urlStr, outputFileStr, ""); else if (sep == outputFileStr.length()) return downloadImageGuessType(urlStr, "IMAGE", outputFileStr); else return downloadImageGuessType (urlStr, outputFileStr.substring(sep), outputFileStr.substring(0, sep));
-
downloadImageGuessType
public static java.lang.String downloadImageGuessType (java.lang.String urlStr, java.lang.String outputFileStr, java.lang.String outputDirectory) throws java.io.IOException
This will download an image, and try to guess if it is one of the following types:.jpg, .png, .bmp, .gif or .jpeg
. If the'urlStr'
has a valid image-type extension as a filename, then that format will be used to save to a file. If that fails, an exception of typejavax.imageio.IIOException
is thrown.
Example:
// Retrieve all images found on a random Yahoo! News web-page article URL url = new URL("https://news.yahoo.com/former-fox-news-employees [actual URL hidden].html"); // Parse & Scrape the Web-Page, store it in a local html-vector Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false); // Skip ahead to the "article body." The body is surrounded by an <ARTICLE>...</ARTICLE> // HTML Element. Retrieve (using 'Inclusive') - everything between the HTML "ARTICLE" Tags. page = TagNodeGetInclusive.first(page, "article"); // Get the SECOND picture (HTML <IMG SRC=...>) element found on the page. // For the news-article used in this example, the first image was an icon thumbnail. // The second image contained the "Main Article Photo" TagNode firstPic = TagNodeGet.nth(page, 2, TC.OpeningTags, "img"); String urlStr = Links.resolveSRC(firstPic, url).toString(); // Run this method. A file named 'img.jpg' is saved. System.out.println("Image URL to Download:" + urlStr); ImageScrape.downloadImageGuessType(urlStr, "img");
- Parameters:
urlStr
- Is theurl
of the image. Yahoo! Images, for instance, have really longURL's
and don't have any extensions at the end. If'urlStr'
does contain an image extension in the'String'
, then this method will attempt to save the image using the appropriate file-extension, and throw an'IIOException'
if it fails.outputFileStr
- This is the target or destination name for the output image file.
NOTE: This file is not intended to have an extension. The extension will be generated by the code in this method, and it will match whatever image-file-encoding was successfully used to download the file. If this is a'.png'
, for instance, but it did not download until'.bmp'
was used (mis-labeled), this output file will be saved as'outputFileStr'
+'.bmp'
.
URL vs. File Names: This parameter'outputFileStr'
may NOT be null. It is important to realize, here, that file-names andURL's
do not obey the same naming conventions. Because it is often seen on the internet image-URL's
that have a plethora of file-system 'irreverent' characters in their name, this method simply cannot pick out the file-name of an image from itsURL
.
It may seem counter-intuitive to expect a "filename" parameter be provided as input here, given that an image-URL
is also required (since in most cases the file-name of the image being downloaded is included in the image'sURL
). However, because many of the modern content-providers on the internet use many layers of naming conventions for their image-URL's
, the user must provide the file-name of the image (as aString
) to avoid crashing this method in situations / cases where the image file-name is "too difficult" to discern from it'sURL
.outputDirectory
- This is just "prepended" to the file-save name. This'String'
is not included in the returned filename. Specifically The returned file name only includes the file-name and the file-name-extension. It does not include the whole "canonical" or "absolute" directory-path name for this image.- Returns:
- It will return the name of the file as a result - including the extension type
which did not throw a
javax.imageio.IIOException.
This exception is thrown whenever an image, of - for instance'.png'
format tries to save as a'.jpg'
, or any other incorrect image-format.
NOTE:'null'
will be returned if the image failed to save at all.
ALSO: If the passed'urlStr'
does not save properly,javax.imageio.IIOException
will also be thrown.
It is important to return the filename, since the extension identifies in what format the image was saved -.jpg, .gif, .png,
etc... - Throws:
WritableDirectoryException
- If the provided output directory must exist and be writable, or else this exception shall throw. Java will attempt to write a small, temporary file to the directory-name provided. It will be deleted immediately afterwards.java.io.IOException
- See Also:
imageExts
- Code:
- Exact Method Body:
// If the "file name" has directory components... it is just "better" to flag this as // an exception if (outputFileStr.indexOf(File.separator) != -1) throw new IllegalArgumentException( "This method expects parameter 'outputFileStr' to be a simple file-name, without " + "any directory-names attached. If directory names need to be attached to ensure " + "that the file is ultimately saved to the proper location in the file-system, " + "pass the directory to the 'outputDirectory' parameter to this method.\n" + "You have passed: " + outputFileStr + "\nwhich contains the file-name separator " + "character." ); if (outputDirectory == null) outputDirectory = ""; // Make sure the directory exists on the file-system, and that it is writable. WritableDirectoryException.check(outputDirectory); // Unless writing the "current directory" - make sure the directory name ends with the // Operating System file-separator character. if ((outputDirectory.length() > 0) && (! outputDirectory.endsWith(File.separator))) outputDirectory = outputDirectory + File.separator; BufferedImage image = ImageIO.read(new URL(urlStr)); String ext = getImageTypeFromURL(urlStr); File f = null; if (ext != null) try { String fName = outputFileStr + '.' + ext; f = new File(outputDirectory + fName); ImageIO.write(image, ext, f); return fName; } // NOTE: If saving the file using the named image-extension fails, try the other. catch (javax.imageio.IIOException e) { f.delete(); } for (int i=0; i < imageExts.length; i++) try { f = new File(outputFileStr + '.' + imageExts[i]); ImageIO.write(image, imageExts[i], f); return outputFileStr + '.' + imageExts[i]; } catch (javax.imageio.IIOException e) { f.delete(); continue; } System.out.println ("NOTE: Image " + urlStr + "\nAttempted to save to:" + outputFileStr + "\nFAILED."); return null;
-
downloadImagesGuessTypes
public static java.util.Vector<java.lang.String> downloadImagesGuessTypes (java.lang.String rootURL, java.lang.Iterable<java.lang.String> urls) throws java.io.IOException
- Code:
- Exact Method Body:
return downloadImagesGuessTypes(rootURL, urls, "");
-
downloadImagesGuessTypes
public static java.util.Vector<java.lang.String> downloadImagesGuessTypes (java.lang.Iterable<java.lang.String> urls) throws java.io.IOException
- Code:
- Exact Method Body:
return downloadImagesGuessTypes("", urls, "");
-
downloadImagesGuessTypes
public static java.util.Vector<java.lang.String> downloadImagesGuessTypes (java.lang.Iterable<java.lang.String> urls, java.lang.String outputDirectory) throws java.io.IOException
- Code:
- Exact Method Body:
return downloadImagesGuessTypes("", urls, outputDirectory);
-
downloadImagesGuessTypes
public static java.util.Vector<java.lang.String> downloadImagesGuessTypes (java.lang.String rootURL, java.lang.Iterable<java.lang.String> urls, java.lang.String outputDirectory) throws java.io.IOException
This will download an entireVector<String>
ofURL's
, and save the output fileNames which were used to save these images. It will use a theStringParse.zeroPad(int)
method to generate filenames - starting with001.jpg
- or whatever extension was correct. It will use the guessed file-name extension that is appropriate for this image.
NOTE: As the images are downloaded, the fileName is printed viaSystem.out.println()
Example:
// Retrieve all images found on the Wikipedia (Encyclopedia) Page for Galileo URL url = new URL("https://en.wikipedia.org/wiki/Galileo_Galilei"); // Parse & Scrape the Web-Page, store it in a local html-vector Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false); // Get the "Vector Index Array" for every HTML <IMG> element found on the page. int[] imgPosArr = TagNodeFind.all(page, TC.OpeningTags, "img"); // Since there are many "relative" or "partial" URL's, make sure to resolve them // against the main Wikipedia page-url. Also, note, that Links.resolve returns a // Vector<URL>, but that ImageScraper.downloadImagesGuessTypes requires a // Vector<String>, so make sure to convert the output url's to strings. Vector<String> urls = new Vector<String>(imgPosArr.length); Links.resolveSRCs(page, imgPosArr, url).forEach((URL u) -> urls.add(u.toString())); // Run this method. A series of '.png' and '.jpg' files will be saved to the current // working directory. ImageScrape.downloadImagesGuessTypes(urls);
- Parameters:
urls
- is aVector
ofString's
that are to contain image pointersrootURL
- if these are "sub-urls", with a rootURL
, this rootURL
is pre-pended to each of theString's
in the'urls' Vector
. This parameter may contain the empty string (""
) (and if it is, it will be ignored)outputDirectory
- The files that are downloaded are saved to this directory.- Returns:
- a
Vector
ofString's
which contains the output filenames of these files. - Throws:
WritableDirectoryException
- If the provided output directory must exist and be writable, or else this exception shall throw. Java will attempt to write a small, temporary file to the directory-name provided. It will be deleted immediately afterwards.java.io.IOException
- See Also:
StringParse.zeroPad(int)
,downloadImageGuessType(String, String)
- Code:
- Exact Method Body:
if (outputDirectory == null) outputDirectory = ""; // Make sure the directory exists on the file-system, and that it is writable. WritableDirectoryException.check(outputDirectory); // Unless writing the "current directory" - make sure the directory name ends with the // Operating System file-separator character. if ((outputDirectory.length() > 0) && (! outputDirectory.endsWith(File.separator))) outputDirectory = outputDirectory + File.separator; if (rootURL == null) rootURL = ""; Vector<String> ret = new Vector<String>(); int count = 0; for (String url : urls) { String fileName = downloadImageGuessType (rootURL + url, StringParse.zeroPad(++count), outputDirectory); System.out.print(fileName + ((fileName.length() < 10) ? ' ' : '\n')); ret.addElement(fileName); } return ret;
-
getImage
public static void getImage(java.lang.String urlStr, java.lang.String outputFileStr, java.lang.String extensionStr) throws java.io.IOException
This downloads an image to a a file named'outputFileStr'
. A valid image-extension needs to be provided for the javaImageIO.write(...)
method to work properly. The'extensionStr'
should beString's
such as:'.jpg'
or'.png'
- Parameters:
urlStr
- TheURL
of the image which generated the exceptionoutputFileStr
- The intended file-name root to which the image is supposed to saveextensionStr
- The intended file-name extension to which this image was to be saved.- Throws:
java.imageio.IIOException
- - if this file type /'extensionStr'
are incorrectjava.io.IOException
- Code:
- Exact Method Body:
File f = new File(outputFileStr); BufferedImage image = ImageIO.read(new URL(urlStr)); ImageIO.write(image, extensionStr, f);
-
downloadImages
public static java.util.Vector<java.lang.String> downloadImages (java.io.File f) throws java.io.IOException, java.io.FileNotFoundException
This method will read from a text-file, which must have a list of image-URL's
from the internet - and download them, one by one, to a directory. Messages will be printed as each file is downloaded viaSystem.out.print()
- Parameters:
f
- A file pointer to a text-file that contains a list ofString's
. EachString
is intended to be aURL
to an image on the internet.- Returns:
- a
Vector
containing the file-names of these images. - Throws:
java.io.IOException
java.io.FileNotFoundException
- Code:
- Exact Method Body:
BufferedReader br = new BufferedReader(new FileReader(f)); Vector<String> pics = new Vector<String>(); String s; while ((s = br.readLine()) != null) pics.addElement(s); return downloadImagesGuessTypes(pics);
-
-