|
|
Configuring Extractors, Tokenizers, TokenMergers and Normalizers
|
Introduction
Extractors locate and extract data from either a string, a DOM node tree, or a list of SAX events.
An Extractor must be the first object in an index's process workflow.
Tokenizers may then be used to split the data. If a Tokenizer is used, then it must be followed by a TokenMerger.
Normalizers may then used to process those terms into a standard form for storing in an index.
Unless you're using a new extractor or normalizer class, they should all be built by the default server configuration, but for completeness we'll go through the configuration below.
Example: Extractors
Example extractor configurations:
01 <subConfig type="extractor" id="SimpleExtractor">
02 <objectType>extractor.SimpleExtractor</objectType>
03 </subConfig>
04
05 <subConfig type="extractor" id="ProxExtractor"<
06 <objectType>extractor.SimpleExtractor</objectType>
07 <options>
08 <setting type="prox">1</setting>
09 <setting type="reversable">1</setting>
10 </options>
11 </subConfig>
Explanation: Extractors
There's obviously not much to say about the first subConfig for SimpleExtractor.
It just does one thing and doesn't have any paths or settings to set.
The second subConfig, for ProxExtractor is a little more complex.
Firstly at line 8, it has the setting "prox", to tell the extractor to maintain which element in the record the data was extracted from.
This is important if you want to be able to conduct proximity, adjacency or phrase searches on the data extracted later.
The second setting, "reversable" on line 9, tells the extractor to maintain this location information in such a way that we can use it to identify where the data originally came from later on
If this setting is not set, or is set to 0, the location will be maintained in such a way that it is only possible to tell if two pieces of data were extracted from the same element.
Some of the currently available extractors:
- SimpleExtractor Extract the data exactly as it appears, but without any XML tags. Whether or not leading/trailing whitespace should be kept or stripped it configurable.
- TeiExtractor Extract text, respecting processing instructions specified by TEI tags, without any XML tags.
- TaggedTermExtractor Each term has already been tagged in XML. Extract the terms and information from the XML tags. (Often used on the results of NLP tasks)
- TemplateTermExtractor Each term has already been tagged in XML. Extract the terms and information from the XML tags. What, and where to extract the information from in configurable. (Often used on the results of NLP tasks)
Example: Tokenizers
[ Coming soon ]
Explanation: Tokenizers
Some of the currently available tokenizers:
- SimpleTokenizer: Splits data into tokens at all occurences of the string specified in 'char' setting, defaults to splitting at all whitespace.
- RegexpSubTokenizer: Carries out a regular expression substitution, replacing a pattern specified in 'regexp' setting (default supplied, covers many common word delimiters, but too complex to repeat here), with string specified in 'char' setting (defaults to single whitespace), before splitting at whitespace.
- RegexpFindTokenizer: Uses a complex regular expression to identify common language tokens (eg regular words, hyphenated words, acronyms, email addresses, URLs, gene alleles, o'clock, O'Reilly, don't, I'll, monetary amounts, percentages). The Regular expression to use to identify words is configurable in the 'regexp' setting - but such configuration is not advisable!
- RegexpFindOffsetTokenizer: As above, but also maintains the character position in the string where each token began.
- RegexpFindPunctuationOffsetTokenizer: As above, but with punctuation characters subtracted from the offset.
- SuppliedOffsetTokenizer: Splits data at whitespace, and maintains precalculated characters offsets. Often used on the results of NLP Tools. Offsets must be supplied in the form: This/0 is/5 my/8 sentence./11
- SentenceTokenizer: Uses a regular expression (not configurable) to identify senteces, ignoring puctuation within commonly occuring abbreviations.
- LineTokenizer: Splits the data at newline characters
- DateTokenizer: Identifies any temporal date / time strings within the data and returns only these. The 'dayfirst' setting can be used to specify whether to assume UK or US conventions in ambiguous cases: 0 = US style, month first (default), 1 = UK style.
- PythonTokenizer: Used to tokenize Python source code into token/type, maintains character offsets.
Example: TokenMergers
[ Coming soon ]
Explanation: TokenMergers
Some of the currently available tokenMergers:
- SimpleTokenMerger: Simplest case: merges identical tokens into a single index entry, taking care of number of occurences and location (at XML element level only). All other TokenMergers inherit these abilities.
- ProximityTokenMerger: As SimpleTokenMerger, but additionally takes care of the position of tokens within the tokenized data.
- OffsetProximityTokenMerger: As ProximityTokenMerger, but additionally takes care of character offset of tokens within tokenized data.
- SequenceRangeTokenMerger: Takes pairs of tokens and joins them into a range to treat as an index entry. This can be useful for numbers or dates (e.g. 1,2,2,3,5,6 --> 1-2, 2-3, 5-6)
- MinMaxRangeTokenMerger: Takes a number of tokens and finds the minimum and maximum to use the extents of a range to treat as an index entry (e.g. 1,2,2,3,5,6 --> 1-6).
- NGramTokenMerger: Joins adjacent tokens into a single index entry. 'nValue' setting specifies the number of adjacent terms to join, the default is 2 (e.g. 'this', 'is', 'my', 'sentence' --> 'this is', 'is my', 'my sentence'). This can be particularly useful in linguistic analysis.
- ReconstructTokenMerger: [ Coming soon ]
Example: Normalizers
Example normalizer configurations:
01 <subConfig type="normalizer" id="CaseNormalizer">
02 <objectType>normalizer.CaseNormalizer</objectType>
03 </subConfig>
04
05 <subConfig type="normalizer" id="StoplistNormalizer">
06 <objectType>normalizer.StoplistNormalizer</objectType>
07 <paths>
08 <path type="stoplist">stopwords.txt</path>
09 </paths>
10 </subConfig>
Explanation: Normalizers
Nomalizers usually just do one pre-defined job, so there aren't many options or paths to set.
The second example (lines 5-10) is a rare exception.
This is a StoplistNormalizer, and requires a path of type 'stoplist' (line 7).
The stoplist file should have one word per line.
The normalizer will remove all occurences of these words from the data.
Some of the currently available normalizers:
- CaseNormalizer: Convert the term to lower case (eg Fish and Chips -> fish and chips)
- PossessiveNormalizer: Remove trailing possessive from the term (eg squirrel's -> squirrel, princesses' -> princesses)
- ArticleNormalizer: Remove leading definite or indefinite article (the fish -> fish)
- PrintableNormalizer: Remove any non-printable characters
- StripperNormalizer: Remove printable punctuation characters: " % # @ ~ ! * { }
- StoplistNormalizer: Remove words from a given stoplist, given in a path of type 'stoplist' (<path type="stoplist">stoplist.txt</path>) The stoplist file should have one word per line.
DateStringNormalizer: Convert a Date object extracted by DateExtractor into an ISO8601 formatted string.
DEPRECATED in favour of DateTokenizer.
- DiacriticNormalizer: Remove all diacritics from characters. (eg é -> e)
- IntNormalizer: Convert a string into an integer (eg '2' -> 2)
- StringIntNormalizer: Convert an integer into a 0 padded string (eg 2 -> '000000000002')
- EnglishStemNormalizer: Convert an English word into a stemmed form, according to the Porter2 algorithm. (eg Fairy -> fairi) You must have run the possessive normalizer before running this normalizer for it to work properly.
- KeywordNormalizer: Convert an exact extracted string into keywords.
- ProximityNormalizer: Convert an exact extracted string into keywords maintaining proximity information.
- ExactExpansionNormalizer: Sample implementation of an acronym and contraction expanding normalizer. Eg 'XML' -> 'Extensible Markup Language'
- WordExpansionNormalizer: Sample implementation of an acronym expander when dealing with words rather than exact strings. Eg 'XML' -> 'Extensible', 'Markup', 'Language'