WordFinderConfig Class Documentation
classWordFinderConfigNamespace:com::datalogics::PDFL
Detailed Description
A word finder configuration that customizes the way the extraction is performed.In the default configuration, all options are false.
Referenced by
Constructor & Destructor Documentation
WordFinderConfig
WordFinderConfig()A word finder configuration that customizes the way the extraction is performed.In the default configuration, all options are false.
Member Function Documentation
DisposeChildren
voidDisposeChildren()Returns:
void[static initializer]
static void[static initializer]()delete
synchronized voiddelete(Booleandisposing)Parameters
disposing: Boolean
Returns:
synchronized voiddelete
synchronized voiddelete()Returns:
synchronized voidfinalize
voidfinalize()Returns:
voidgetDisableCharReordering
booleangetDisableCharReordering()Returns:
booleanWhen true, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering (disableCharReordering = true) may produce a more static result.
getDisableTaggedPDF
booleangetDisableTaggedPDF()Returns:
booleanWhen true, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version.
getIgnoreCharGaps
booleangetIgnoreCharGaps()Returns:
booleanWhen true, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.
getIgnoreLineGaps
booleangetIgnoreLineGaps()Returns:
booleanWhen true, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.
getNoAnnots
booleangetNoAnnots()Returns:
booleanWhen true, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.
getNoEncodingGuess
booleangetNoEncodingGuess()Returns:
booleanWhen true, it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling.
getNoExtCharOffset
booleangetNoExtCharOffset()Returns:
booleanWhen true, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.
getNoHyphenDetection
booleangetNoHyphenDetection()Returns:
booleanWhen true, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.
getNoLigatureExp
booleangetNoLigatureExp()Returns:
booleanWhen true, it disables the expansion of ligatures using the default ligatures. The default ligatures are:
fi
ff
fl
ffi
ffl
ch
cl
ct
ll
ss
fs
st
oe
OE
getNoSkewedQuads
booleangetNoSkewedQuads()Returns:
booleanWhen true, it disables the creation of a quad per character for skewed words, words with a horizontally-aligned, but non-rectangular, bounding region. Each skewed word will, instead, be associated with a single rectangular, bounding region.
getNoStyleInfo
booleangetNoStyleInfo()Returns:
booleanWhen true, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot access the the StyleTransition property of Word objects returned from WordFinder.
getNoTextRenderMode3
booleangetNoTextRenderMode3()Returns:
booleanWhen true, it disables extracting text with Text Rendering mode Tr = 3 ("Neither fill nor stroke text (invisible)."). Normally, the word finder extracts such text as any other.
getNoXYSort
booleangetNoXYSort()Returns:
booleanWhen true, it disables generating an XY-ordered word list.
getPreciseQuad
booleangetPreciseQuad()Returns:
booleanWhen true, the bounding box or bounding quad will be set based on actual glyph bounding box.
getPreserveRedundantChars
booleangetPreserveRedundantChars()Returns:
booleanWhen true, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output.
Since this option may leave extra characters with overlapping bounding boxes, using it together with the disableCharReordering option is recommended for more consistent text extraction results.
getPreserveSpaces
booleangetPreserveSpaces()Returns:
booleanWhen true, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the WordAttributeFlags.AdjacentToSpace attribute, but there is no way to restore the exact number of consecutive space characters.
getTrustNBSpace
booleangetTrustNBSpace()Returns:
booleanWhen true, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.
getUnknownToStdEnc
booleangetUnknownToStdEnc()Returns:
booleanWhen true, it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the noEncodingGuess option.
setDisableCharReordering
voidsetDisableCharReordering(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering (disableCharReordering = true) may produce a more static result.
setDisableTaggedPDF
voidsetDisableTaggedPDF(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version.
setIgnoreCharGaps
voidsetIgnoreCharGaps(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.
setIgnoreLineGaps
voidsetIgnoreLineGaps(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.
setNoAnnots
voidsetNoAnnots(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.
setNoEncodingGuess
voidsetNoEncodingGuess(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling.
setNoExtCharOffset
voidsetNoExtCharOffset(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.
setNoHyphenDetection
voidsetNoHyphenDetection(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.
setNoLigatureExp
voidsetNoLigatureExp(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables the expansion of ligatures using the default ligatures. The default ligatures are:
fi
ff
fl
ffi
ffl
ch
cl
ct
ll
ss
fs
st
oe
OE
setNoSkewedQuads
voidsetNoSkewedQuads(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables the creation of a quad per character for skewed words, words with a horizontally-aligned, but non-rectangular, bounding region. Each skewed word will, instead, be associated with a single rectangular, bounding region.
setNoStyleInfo
voidsetNoStyleInfo(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot access the the StyleTransition property of Word objects returned from WordFinder.
setNoTextRenderMode3
voidsetNoTextRenderMode3(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables extracting text with Text Rendering mode Tr = 3 ("Neither fill nor stroke text (invisible)."). Normally, the word finder extracts such text as any other.
setNoXYSort
voidsetNoXYSort(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables generating an XY-ordered word list.
setPreciseQuad
voidsetPreciseQuad(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, the bounding box or bounding quad will be set based on actual glyph bounding box.
setPreserveRedundantChars
voidsetPreserveRedundantChars(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output.
Since this option may leave extra characters with overlapping bounding boxes, using it together with the disableCharReordering option is recommended for more consistent text extraction results.
setPreserveSpaces
voidsetPreserveSpaces(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the WordAttributeFlags.AdjacentToSpace attribute, but there is no way to restore the exact number of consecutive space characters.
setTrustNBSpace
voidsetTrustNBSpace(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.
setUnknownToStdEnc
voidsetUnknownToStdEnc(booleanvalue)Parameters
value: boolean
Returns:
voidWhen true, it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the noEncodingGuess option.