AlfrescoContentExtractor

Alfresco content extractor using CMIS technology

This alfresco extractor will use the CMIS technology to fetch your document content from a given Alfresco repository

Mandatory settings

Key	Type	Description
Alfresco connection provider	AlfrescoCMISConnectionProvider	CMIS version must be 1.1

Optional settings

Key	Type	Description	Default value
Property Helper	PropertyHelper
Extract document content	Boolean		true

AlfrescoRestContentExtractor

Alfresco content extractor using Alfresco REST protocol

This task relies on the Alfresco public REST API (with v1.0.4 of the Alfresco REST client) to retrieve documents and metadata into a given Alfresco instance

Mandatory settings

Key	Type	Description
Alfresco connection provider	AlfrescoRESTConnectionProvider

Optional settings

Key	Type	Default value
Extract annotations	Boolean
Date format	String	E MMM dd HH:mm:ss Z YYYY
Extract content	Boolean

AWSContentSource

Extract content from AWS S3 bucket

Mandatory settings

Key	Type	Description
AWS access credentials	AWSConnectionProvider	Credentials of the user (must have been granted AmazonS3FullAccess permission).

Optional settings

Key

Type

Description

Default value

ARN key for getAwsPrefixKMS encryption

String

Ex/ arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab

Bucket name

String

Name of the S3 bucket where the content is stored.

${bucket}

Content path (S3 object key)

String

Path leading to S3 object corresponding to the content you intend to extract from the bucket. To use this options, you must enable the content extraction option.

Ex/ ${contentPath}

Process s3 objects as punnets

Boolean

Extract punnet contents from S3 Objects

Boolean

All existing contents of documents will be replaced by the newly found contents, retrieved from the S3 bucket.

CMContentExtractor

Basic content extractor from Content Manager

This class is dedicated to the extraction of content for the Content Manager solution. You’ll have the possiblity to extract annotations, custom properties or even logs.

Mandatory settings

Key	Type	Description
CM connection provider	CMConnectionProvider

Optional settings

Key	Type	Description	Default value
Extract document annotation	Boolean		false
Extract advanced system properties from DKDDO object	Boolean		true
Extract standard system properties	Boolean		true
Extract note logs	Boolean		false
Default page height	Float	This value is used when converting annotations to XFDF. Its type is float.	842.0f
Extract note logs as annotations	Boolean		false
Extract custom properties	Boolean		true
Annotation converter	CMAnnotationConverter
Extact history logs	Boolean		true
Default page width	Float	This value is used when converting annotations to XFDF. Its type is float.	595.0f
Save annotations as XFDF	Boolean	If disabled, annotations will be saved under raw CM format.	true
Extract document content	Boolean		true

CMODContentExtractor

Basic CMOD content extractor

Mandatory settings

Key	Type	Description
CMOD Connection Settings	CMODConnectionProvider

Optional settings

Key	Type	Description	Default value
Pattern to store resource files	String		${resourceId}
Export attached CMOD resources	Boolean		true

DctmContentExtractor

Extract document-related details from Documentum

This Documentum connector is designed for extraction of document versions, metadata, folders and content (only the 1st content of a document) from a Documentum repository. Multiversion documents will be retrieved from the shared ‘i_chronicle_id’. Since Documentum architecture involves particular port and access management, a worker should be started on the same server where Documentum is running;

Make sure to check the basic requirements at the setup for Documentum on the official Fast2 documentation.

Mandatory settings

There is no mandatory configuration field for this task.

Optional settings

Key	Type	Description	Default value
Connexion information to Documentum Repository	DctmConnectionProvider
Extract folders	Boolean		true
Extract renditions	Boolean	Check this option to extract renditions of each document. They will be attached as side-contents in the document, with properties populated from original renditions properties.
Whitelist for metadata to extract	String	All values need to be separated by comma `,`.
Extract metadata	Boolean		true
Continue on fail	Boolean	If `true`, any error which occurs during extraction of either metadata, content or folders will trigger an exception. Otherwise, the error will be found in the logs.
Extract content	Boolean		true
Extract all versions	Boolean

FileNet35ContentSource

Extract content from FileNet 3•5

Use this task to retrieve content of documents to extract from a given FileNet instance. This task needs to be preceeded by a FileNet35Source task.

Mandatory settings

Key	Type	Description
FileNet 3.5 connection provider	FileNet35ConnectionProvider	Connection parameters to the FileNet instance

Optional settings

Key	Type	Description	Default value
Ignore documents with zero-sized content	Boolean	Document without any content will not be processed	false

FileNetContentExtractor

Extract document content from FileNet P8

This task is not a real source task. The documents to be extracted are identified by an BlankSource task generating a set of ‘empty’ Punnets, i.e. containing only documents each bearing a document number (documentId) to extract.

Mandatory settings

Key	Type	Description
FileNet connection provider	FileNetConnectionProvider	Connection parameters to the FileNet instance

Optional settings

Key

Type

Description

Default value

Property Helper to use

PropertyHelper

Extract object type properties

Boolean

The FileNet P8 metadata of the document which are Object type will be saved at the punnet level

false

Compound parent data for children references

String

Name of the parent document property under which the children properties will be stored.

Compound children data to record

String

Name of the child property to store in the parent. Consider setting parent data name as well.

Object store name

String

Name of the repository to extract from

Extract FileNet system properties

Boolean

Save the FileNet system properties as document metadata

false

Skip annotation exceptions

Boolean

Extract documents even if related annotations are in exception like null content

false

Default mimetype

String

Default mimetype to set if the one from FileNet is empty

Extract FileNet security

Boolean

The security of the document will be saved at the punnet level

false

SQL fetch query

String

Use this SQL to fetch documents based on your criteria.

Ex/ SELECT [Id],[DocumentTitle] FROM Document WHERE [Property] = ‘${myCriterion}’

Extract folders absolute path

Boolean

The absolute path of the folder inside the FileNet instance will be extracted during the process

false

Extract content

Boolean

The document content will be extracted during the process

true

Extract annotations

Boolean

All annotations owned by the document will be extracted

true

Extract all versions

Boolean

Extract the superseded versions of the documents matching the query

IDMISContentExtractor

ImageServices WAL JNI-bridged Extractor

This task extracts documents from the Panagon Image Services ECM (indexes, optional content and annotations). One punnet of one document for each ECM document. However, it’s not a real source task. The documents to be extracted are identified by a BlankSource task generating a set of empty Punnets, i.e. containing only documents each bearing a document number (documentId) to extract.

Mandatory settings

Key	Type	Description
Password	String	Password of the aforementioned username
Connection domain	String	Domain name of the connection
Connection organization	String	Organization name for the connection
Username	String	Login with scope to access the docbase with proper rights

Optional settings

Key	Type	Description	Default value
Annotations in ARender format	Boolean	Convert annotations to ARender format	false
Annotation converter	ParseISAnnotation	Specific converter from IS format. Allow to resize the extracted annotations
Annotations in raw format	Boolean	Save annotation contents in raw format inside the punnet	false
Version of libIDMIS	String	This task is based on the WAL library and on the specific Fast2 library ‘libIDMIS.dll’. This library must be in a directory of the Windows PATH. In the wrapper.conf or hmi-wrapper.conf file, activate the use of this library: wrapper.java.library.path. = ../libIDMIS/w32For the moment, only 32-bit libraries are configured	libIDMIS-1.0.15
Test scenarios	Boolean	Empty testing stub instead of libIDMIS	false
Connection terminal	String	Terminal name for the connection
Use opacity for annotations	Boolean		false
Unrecognized annotation file path	String	Path of the alternative annotation xml file for unrecognized annotation. If not specified the punnet will go in exception
Extract document content	Boolean	The document will be extracted with its content	true
Extract document annotation	Boolean	The associated annotations will be extracted	true

MDOParserExternalContent

Parse FWTF (Fixed Width Text File) with external content to a punnet description

An MDO file is a flat file defined such as: each line corresponds to a document and each line contains information about the document The extraction of information from each line is based on a CSV configuration file, which provides the name of the metadata to be inserted into the punnet document, as well as its characteristics.

It consists of the following columns, separated by a comma:

Field: name of the metadata to add
Length: length of the metadata. If the value is greater than this length, then it will be truncated. If the value is lower, it will be completed by spaces on the right
Offset: position in MDO file
Mandatory: Y / N
Occurs: number of occurrences allowed for the field. The successive values of the field will then be added to the values of the metadata (respecting the Length parameter for each one)
Type: Type of metadata to add to the punnet document

The MDOParserExternalContent task is used to retrieve external content for each document. To do this, the name of the column defining the content path is specified in the task settings.

Mandatory settings

Key	Type	Description
MDO format specification file path	String	CSV configuration absolute file path containing MDO format specification

Optional settings

Key	Type	Description	Default value
File scanner	FileScanner	Recovers your files
Date format	String	Date format used in MDO file. Must be the same for each line of the document	yyyy-MM-dd
Property name containing path content	String	Name of the field in the configuration file that contains the path to the content. If not filled, the content will not be saved in the punnet
Dataline property name	String	Name of the metadata that will contain the MDO line read. If not specified, the line read will not be saved in the punnet
Create one punnet for each document of FWTF	Boolean	If true then a punnet with one document will be created for each entry in the MDO file. Otherwise, one punnet will be created containing as many documents as there are entries in the MDO file	false
contentLocationAbsolute	Boolean
Last punnet property name	String	Data name indicating which punnet is the last of document in punnet. If null, data isn’t added in punnet. For multipunnet case only

MDOParserInternalContent

FWTF (Fixed Width Text File) parser with internal content

Like the MDOParserExternalContent task, the MDOParserExternalContent source allows you to parse each line of the MDO file in Punnet. The difference between these two tasks is that the content is stored inside the MDO itself. The start and end of the content is defined by a tag specified in the task settings

Mandatory settings

Key	Type	Description
MDO format specification file path	String	CSV configuration absolute file path containing MDO format specification

Optional settings

Key	Type	Description	Default value
File scanner	FileScanner	Recovers your files
Date format	String	Date format used in MDO file. Must be the same for each line of the document	yyyy-MM-dd
End tag	String	End tag property name signifying the end of the content
Dataline property name	String	Name of the metadata that will contain the MDO line read. If not specified, the line read will not be saved in the punnet
Create one punnet for each document of FWTF	Boolean	If true then a punnet with one document will be created for each entry in the MDO file. Otherwise, one punnet will be created containing as many documents as there are entries in the MDO file	false
Last punnet property name	String	Data name indicating which punnet is the last of document in punnet. If null, data isn’t added in punnet. For multipunnet case only
Original text content property name	String	Data name containing original text content. If null, data isn’t added in the punnet

SQLContentExtractor

Extract document content from SQL

Extract clob and blob object-types. Classic types like varchar are extraced as well

Mandatory settings

Key	Type	Description
SQL connection provider	SQLQueryGenericCaller
SQL query	Pattern	Select precisely documents you want to extract through a classic SQL query

Optional settings

Key	Type	Description
SQL mapping for content	String/String map	Mapping of SQL properties to document content.

Content source tasks

AlfrescoContentExtractor

Mandatory settings

Optional settings

AlfrescoRestContentExtractor

Mandatory settings

Optional settings

AWSContentSource

Mandatory settings

Optional settings

CMContentExtractor

Mandatory settings

Optional settings

CMODContentExtractor

Mandatory settings

Optional settings

DctmContentExtractor

Mandatory settings

Optional settings

FileNet35ContentSource

Mandatory settings

Optional settings

FileNetContentExtractor

Mandatory settings

Optional settings

IDMISContentExtractor

Mandatory settings

Optional settings

MDOParserExternalContent

Mandatory settings

Optional settings

MDOParserInternalContent

Mandatory settings

Optional settings

SQLContentExtractor

Mandatory settings

Optional settings