Class DictionaryCompoundWordTokenFilter

All Implemented Interfaces:
Closeable, AutoCloseable, Unwrappable<TokenStream>

public class DictionaryCompoundWordTokenFilter extends CompoundWordTokenFilterBase
A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

  • Constructor Details

    • DictionaryCompoundWordTokenFilter

      public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary)
      Parameters:
      input - the TokenStream to process
      dictionary - the word dictionary to match against.
    • DictionaryCompoundWordTokenFilter

      @Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch, boolean onlyLongestMatchIgnoreSubwords)
      Deprecated.
      Parameters:
      input - the TokenStream to process
      dictionary - the word dictionary to match against.
      minWordSize - only words longer than this get processed
      minSubwordSize - only subwords longer than this get to the output stream
      maxSubwordSize - only subwords shorter than this get to the output stream
      onlyLongestMatch - deprecated, use parameter onlyLongestMatchIgnoreSubwords instead
      onlyLongestMatchIgnoreSubwords - Subwords are igored, e.g. if a word contains 'schwein', only the longer word 'schwein' will be extracted, the subword 'wein' will be ignored. Supersede parameter onlyLongestMatch
    • DictionaryCompoundWordTokenFilter

      public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatchIgnoreSubwords)
      Parameters:
      input - the TokenStream to process
      dictionary - the word dictionary to match against.
      minWordSize - only words longer than this get processed
      minSubwordSize - only subwords longer than this get to the output stream
      maxSubwordSize - only subwords shorter than this get to the output stream
      onlyLongestMatchIgnoreSubwords - Subwords are igored, e.g. if a word contains 'schwein', only the longer word 'schwein' will be extracted, the subword 'wein' will be ignored. Supersede parameter onlyLongestMatch
  • Method Details