Introduction
PreParsers are used to convert documents when they are being introduced to the system into a form in which they can be most easily processed.
They typically only do one thing, and as such do not have extensive configuration sections.
Example
Example preParser configurations:
01 <subConfig type="preParser" id="SgmlPreParser">
02 <objectType>extractor.SgmlPreParser</objectType>
03 <options>
04 <setting type="emptyElements">lb ptr extptr hr<setting>
05 </options>
06 </subConfig>
07
08 <subConfig type="preParser" id="CharacterEntityPreParser">
09 <objectType>extractor.CharacterEntityPreParser</objectType>
10 </subConfig>
Explanation
There's obviously not much to say, as these objects only do one thing and don't have a lot of options or paths to set.
The first example is one of the only ones that does, and has a list of empty SGML elements to be converted to empty XML elements (eg <hr> -> <hr/>)
Some of the currently available PreParsers:
- SgmlPreParser: Convert SGML into XML (lowercase element names, quote all attributes, fix empty tags).
- PrintableOnlyPreParser: Remove any non printable characters
- PdfToTxtPreParser: Convert PDF into raw text format using pdftotext utility
- TxtToXmlPreParser: Wrap raw text in some simple xml tags.
- HtmlSmashPreParser: Attempts to reduce HTML to its raw text.
- RegexpSmashPreParser: Either strip, replace or keep data which matches a given regular expression.
'regexp' setting specifies the regular expression to match in the data.
'char' setting specifies the character(s) to replace matches to the regular expression with (defaults to empty i.e. strip all matches).
'keep' setting specifies that instead of replacing matches, only these should be kept.
- CharacterEntityPreParser: Turn latin-1 entities into XML character entities. (eg – -> –)
- MarcToXmlPreParser: Convert MARC records into MARCXML
- MarcToSgmlPreParser: Convert MARC records into MARCSGML (Cheshire2 format)
- GzipPreParser: Uncompress a gzipped file
- BzipPreParser: Uncompress a bzipped file
- B64EncodePreParser, B64DecodePreParser: Encode or Decode a file using base 64.
- OpenOfficePreParser: Use an OpenOffice server, turn any format that OOo recognises (.doc , .xls , .odt etc.) into OpenDocument XML.