In the previous post, we discussed how to extract images from documents in Java. Today, we will be looking to achieve the same objective using C#. No worries if you have not visited the last post. In this article, we will be learning to programmatically extract images from PDF, Excel, PowerPoint, and Word documents in a C# application using document parsing .NET API.
Following topics will be covered here:
- Image, Text, and Metadata Extraction .NET API
- Image Extraction from PDF documents
- Extract Images from Word, Excel, PowerPoint documents
- Extract Image from Specific Page
- Supported Formats for Image Extraction
Image, Text, and Metadata Extraction .NET API
GroupDocs.Parser for .NET is document parsing and data extraction .NET API. It supports document parsing and extraction of images, text, and metadata from word-processing documents, spreadsheets, presentations, archives, and email documents. At the end of the article, document formats are mentioned that are supported by the API for image extraction.
In this article, we will use this API, so I would recommend to download its binaries or install the API from NuGet to prepare the environment.
Extract Images from PDF Documents in C#
You can easily retrieve all the images from any PDF document by following these simple steps.
- Instantiate the Parser class object with the source document.
- Call GetImages method of Parser class to get the collection of all the images in PageImageArea objects.
- Iterate over PageImageArea to get every image.
- Save images on the disk using the Save method of PageImageArea.
Extracted images can be saved in BMP, GIF, JPEG, PNG, and WebP formats. The complete code is shown below to demonstrate the whole steps.
Extract Images from Word, Excel, PowerPoint Files in C#
Not restricted to just PDF format, we can take out all the images from word-processing documents, spreadsheets, presentations, with the unchanged code base. Just change the source document path with the file extension, your document will be parsed to extract and save all the images to the disk.
using (Parser parser = new Parser("path/document.docx")) // Word Document // using (Parser parser = new Parser("path/document.xlsx")) // Excel Spreadhseet // using (Parser parser = new Parser("path/document.pptx")) // Presentation // using (Parser parser = new Parser("path/document.pdf")) // PDF Document
Image Extraction from Specific Document Page in C#
If you want to extract images from a specific page of the document, it can be done easily using the below-mentioned steps and C# code.
- Get the information about the document using the GetDocumentInfo method.
- From the document information, take out the total PageCount and other information.
- Use GetImages(pageIndex) method and pass your target page index to it.
- To save the retrieved images, traverse the images collection, and save the individual image using the Save method.
Supported Formats for Image Extraction in C#
Following are the document formats that are supported by the GroupDocs.Parser for .NET API for image extraction.
Word Processing Documents | DOC, DOCX, DOCM, DOT, DOTX, DOTM, ODT, OTT, RTF |
Spreadsheets | XLS, XLSX, XLSM, XLSB, XLT, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS |
Presentations | PPT, PPTX, PPTM, PPS, PPSX, PPSM, POT, POTX, POTM, ODP, OTP |
Portable Documents | |
Emails | EML, EMLX, MSG |
Archives | ZIP |
More about GroupDocs.Parser
- Documentation
- Source Code Examples
- API Reference
- Family (On-Premise APIs| Cloud APIs | Free Online App)
Let’s talk some more @ Free Support Forum