Extract Images from PDF, Word, Excel, PPT using Java

If you have a document and you want to use the images inside that document in some other documents, here is one of the solutions. In this article, we will be learning to programmatically extract images from PDF, Excel, PowerPoint, and Word documents using Java.

Image Extraction Java API
Image Extraction from PDF documents in Java
Extract Images from Word, Excel, PowerPoint documents in Java
Extract Image from Specific Page in Java

Image Extraction Java API

Parse Documents and Extract Data in Java

For the extraction of images, we will use GroupDocs.Parser for Java. This Java API supports the parsing of documents and extraction of images, text, and metadata from word-processing documents, spreadsheets, presentations, archives, and email documents. The following are the document formats supported by the Java API for image extraction.

Word Processing Documents	DOC, DOCX, DOCM, DOT, DOTX, DOTM, ODT, OTT, RTF
Spreadsheets	XLS, XLSX, XLSM, XLSB, XLT, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS
Presentations	PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP
Portable Documents	PDF
Emails	EML, EMLX, MSG
Archives	ZIP

Before you start with the examples below, I would recommend to set up the environment by downloading the latest version of document parsing Java API from the downloads section or you may set the following configurations in your maven-based java applications:

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>http://repository.groupdocs.com/repo/</url>
</repository>

<dependency>
	<groupId>com.groupdocs</groupId>
	<artifactId>groupdocs-parser</artifactId>
	<version>20.8</version> 
</dependency>

Extract Images from PDF Documents in Java

Follow these simple steps to get all images from the PDF document.

Instantiate Parser class object.
Call getImages method of Parser class to get all the images.
Iterate over images using PageImageArea.
Save images using the save method of PageImageArea.

It’s done. See the full code below. Extracted images can be saved in BMP, GIF, JPEG, PNG, and WebP formats.

These are the images retrieved from the PDF document using the above code.

Extracted Images from Document using Java

Extract Images from Word, Excel, PowerPoint Files in Java

Similarly, all the images can be taken out from the word-processing files, spreadsheets, presentations, with the unchanged code base. What you have to change? Just the source document path and the right file extension.

Parser parser = new Parser("path/document.docx") // Word Document
// Parser parser = new Parser("path/document.xlsx") // Excel Spreadsheet
// Parser parser = new Parser("path/document.pptx") // PowerPoint Presentation
// Parser parser = new Parser("path/document.pdf") // PDF Document

Image Extraction from Specific Document Page in Java

If you do not want to extract all the images from the whole document but from some specific page. Below code demonstrates how we can extract images from a particular page of the document in Java.

Conclusion

Today, we learned how to extract images from the whole document, and the specific page of word-processing documents, spreadsheets, presentations, and PDF in Java. There is no difference in the code if we have to extract images from the files of different file formats. We just have to pass the right path and name. That’s it.

Extract Images from Documents using Java

Image Extraction Java API

Extract Images from PDF Documents in Java

Extract Images from Word, Excel, PowerPoint Files in Java

Image Extraction from Specific Document Page in Java

Conclusion

See Also

Search

Follow Us

Categories