Redact PDF Scanned Documents in Java

Want to secure the secret, or sensitive information that is within the documents? It is doable even if this is regular textual information or it is text with the scanned document with images. The earlier articles may help you refine your search, where we discussed the different strategies to search words and search synonyms within multiple documents. This article guides you about how to redact PDF text and text in images within a document using Java.

The following topics will be covered below:

Java API for Text and Image Redaction

GroupDocs.Redaction provides the redaction solution to secure the classified information. Its Java API allow you to redact or removing confidential information within documents of various file formats from your Java-based applications. Along with the simple text redaction and rasterization, the API also allows identifying the text in images that may have been inside any document like most commonly used scanned PDF files. The complete list of supported file formats is available in the documentation.

Download or Configure

You may download the JAR file from the downloads section, or just get the latest repository and dependency configurations for the pox.xml of your maven-based Java applications.

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>https://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
        <groupId>com.groupdocs</groupId>
        <artifactId>groupdocs-redaction</artifactId>
        <version>21.6</version> 
</dependency>

Redact PDF Text and Scanned Image Text using Java

We have already discussed the different ways to find and replace text in documents. However, we can also redact text within images. I will use the following PDF document, that contains some text and also an image with some text. For this, we need to combine OCR with the redaction process. Firstly, we will identify the text in the document and also the text which is inside the image of the document. Then, we will cover it with a black box to programmatically hide any legal, confidential, or secret information even if is as text within a scanned document image.

PDF with text and scanned image

The following steps will detect and replace the text in the PDF documents, that contains regular text or any text within the embedded images.

  • Prepare the redactor settings using any OCR Connector.
  • Load your PDF file using Redactor class and also if there are any specific loading options required.
  • Define your replacement options. I am opting to black out the text.
  • Prepare the redactions; use the appropriate redaction strategy like Phrase Redaction or RegEx redaction.
  • Apply the redactions using the apply method.
  • Save the redacted document using the save method.

The following source code redacts the selected text within a PDF document using Java.

The output of the above code is as follows with the blacked-out selected text of the PDF document.

Redact PDF text and scanned image text

Get a Free API License

You can get a free temporary license to use the API without the evaluation limitations.

Conclusion

To conclude, you have learned how to redact text in documents. Additionally, we discussed how to redact text in the images within a PDF document using Java. Similarly, you can redact text and images with documents of any other format. We used the regular expressions redaction, however, it can also be done using many different ways. Later we hid the search results using a black box.

For more details to learn about the API, visit the documentation. For queries, contact us via the forum.

See Also