Go Up
You are here: IntegrationData Discovery and ClassificationDDC CollectorEnable Optical Character Recognition

Enable Optical Character Recognition

Optical Character Recognition, or OCR, is a technology that enables you to convert different types of files, such as stand-alone images, PDF files and Microsoft Office documents with integrated images into discoverable data. By default, this option is disabled to avoid loss of performance.

Out of the box, DDC Collector processes JPEG, png, TIFF, and Bitmap images. For the full list of supported content types, refer to Supported Content Types section. If you want to enable OCR, configure the product as follows:

To... Do...

Recognize stand-alone images

Do the following to enable OCR for image files having specific extension:

  1. In DDC Collector console, navigate to Sources File.
  2. Select Files Included on the left.
  3. Click Add Inclusion on the right pane to add desired extension.

Recognize documents with integrated images

  1. In DDC Collector console, navigate to Config Settings Core Collector.
  2. Select the Process Document Images option.

The settings will be applied in an hour after configuration. If you want to start process images and documents earlier, navigate to the Services snap-in and restart the following services:

  • conceptIndexer
  • ConceptCollector
  • conceptClassifier

NOTE: Make sure that DDC Collector does not process any files, otherwise service restart may fail data classification process.

Go Up