A classification is basically an approach in which text is systematically identified and then organized according to rules. Taxonomy defines the science of such classification. When you are dealing with a bunch of textual documents, it gets hard to find a topic of any document until the taxonomic classification of the content. In this article, you will learn how to programmatically classify documents according to IAB-2 and document taxonomy using C#.
The following topics are covered below:
- .NET API for Taxonomic Classification
- Document Classifcation with IAB-2 Taxonomy
- Classify Documents with Document Taxonomy
- Classify Password Protected Documents
.NET API for Taxonomic Classification of Documents
GroupDocs.Classification provides the classification solution for different kinds of applications. Its .NET API allows you to classify documents of various file formats according to different taxonomic categories within your .NET applications. We will use its GroupDocs.Classification for .NET API for the classification of PDF and Word documents using C#.
You can download the DLLs or MSI installer from the downloads section or install the API in your .NET application via NuGet.
PM> Install-Package GroupDocs.Classification
Classify Documents with IAB-2 Taxonomy using C#
IAB-2 categorizes the document’s content into multiple topics and then classifies it based on the depth level. The following are the steps to identify the taxonomic classification of documents with IAB-2 taxonomy using C#.
- Instantiate the classifier using Classifier class.
- Define the input document and input folder.
- Define the Taxonomy as IAB2.
- Set the count for the first few best results in the response. (Optional)
- Get the taxonomic categories by calling Classify method with the defined parameters.
- Print the Best Class Name and Probability using the classification response of the Classify method.
The following C# source code shows how to classify documents using IAB-2 taxonomy and get some of the top document classification results.
Class: Technology_&Computing, Probability: 0.8188434 Class: Video_Gaming, Probability: 0.12686 Class: Hobbies&_Interests, Probability: 0.03112753 Class: Music_and_Audio, Probability: 0.006756512
Classify Documents with Document Taxonomy using C#
Documents taxonomy is used to identify different document classes, such as Invoices, CVs, forms, emails, etc. The following are the steps to identify the taxonomic classification of documents with document taxonomy using C#.
- Instantiate the classifier using Classifier class.
- Set the input document and folder.
- Define the Taxonomy as Documents.
- Set the count for the number of top results in the response. (Optional)
- Get the taxonomic groups by calling Classify method with the above defined parameters.
- Print the Best Class Name and Probability using the classification response of the Classify method.
The following C# source code shows how to classify documents and get some of the best taxonomic categories using document taxonomy.
Class: ADVE, Probability: 0.3874436 Class: Resume, Probability: 0.2438204 Class: News, Probability: 0.1357582 Class: Memo, Probability: 0.0641943
Classify Password Protected Documents using C#
If your document is secured with a password, you can just provide the credentials while classifying. The following are the steps for the classification of password-protected documents using C#
- Instantiate the Classifier.
- Define the input document, input folder, and password of the protected document.
- Define the Taxonomy as Documents.
- Get the taxonomic group by calling Classify method with the defined parameters.
- Get the Best Class Name and Probability from the response of the Classify method.
The following code snippet shows how to classify password-protected documents and get the best taxonomic category using the default taxonomy (IAB-2).
Best Class: Hobbies_&_Interests, Probability: 0.4548415
The default values for the taxonomy would be IAB-2 and the count of the best results would be 1.
Get a Free License
You can get a free temporary license in order to use the API without the evaluation limitations.
Conclusion
To conclude, we learned to classify various kinds of documents using different taxonomies. More precisely, we classified PDF documents as per IAB-2 and document taxonomies using C#. Further, we discussed how we can classify password-protected Word documents with default or specific taxonomic classification. Now you can integrate the document classification feature within your .NET application.
For more about the API, visit the documentation. For queries, contact us via the forum.