The document classification and clustering product organizes documents into topical sets and assigns labels to each document.

The document classification and clustering product helps customers organize their documents into topical sets and assign discovered labels to each document.

The package is a complete solution for:

  • Grouping documents into sets based on semantic similarity
  • Suggesting topics for each of the groups
  • Assigning new documents to one of the groups

Diagram of document classification and clustering

While most classification and clustering products use bag of words and co-occurrence statistics, the Lymba document classification and clustering product is unique because it uses the rich output of the K-Platform to categorize and label documents. The K-Platform products that can be configured for use during classification or clustering are:

The feature extraction for the document classification and clustering is also configurable.

  • Adjust pre-determined parameters and use the built-in feature extraction
  • Select the machine learning type and parameters
  • Write a third party feature extractor to customize your feature set using external sources and/or the Natural Language Processing markup from one or more of the K-Platform products.

The clustering step is optional. If the document set is either pre-labeled or has a meaningful structure on disk, like email folders, then those labels will be used.

Document classification and clustering has wide applicability and is often the first step in any Natural Language Processing application with requirements to quickly organize and label documents. See PowerAgent™ for an application of this product tailored for Customer Relation Management.