Posts Tagged ‘ESA’
English Stemming Algorithm (2003)
Sunday, June 14th, 2009English, Joshua S. (2003)
Introduction
This document describes the English Stemming Algorithm, which is intended for reducing the morphological variation of a given stream of tokens through conflation; commonly called stemming. Stemming is used in information retrieval (IR) applications to enhance performance.
Definition
The English Stemming Algorithm (ESA) uses a list of rules to reduce any given token, such as a word, to a common base. The rules are applied iteratively by matching a suffix string against token endings. The rule is flagged so that it may or may not apply to the token depending on if the token has already been modified by a previous rule. Each rule specifies removing, replacing, or adding characters to the end of the token.
The ordering of the rules is significant because a rule may be designed to act upon changes made by a previous rule. Each rule may also be of arbitrary length and consists of one or more instances of the four ESA rule components. The list of rules is processed iteratively until each has had a chance to apply itself to the token. When considering the flag of the second or greater instance in a rule, the flag only applies to changes made by the rule.
The ESA imposes the following boundary conditions. The input must be greater than two characters in length and must contain at least one vowel and one consonant. The output must be greater than two characters in length.
Rule Format
Rules consist of the following four parts:
- A suffix string (token ending) of 1 or more characters.
- A flag (y, Y, n, or N) indicating if the rule can be applied to the token. If the flag is ‘y’ or ‘Y’ and the token has already been modified by another rule or this same rule (for rules with multiple instances) the rule cannot be applied to the token.
- The number of characters to remove.
- The string to append to the token consisting of 0 or more characters.
In the reference implementation, rules were formatted with a comma character as a delimiter. For example:
“ing,n,3,”
The above rule looks for tokens ending in ‘ing’, this rule is applied regardless of token modification, three characters are removed, and no characters are appended. The rule format may be repeated to create more complex rules. For example:
“s,n,1,,ies,n,2,,si,n,1,s”
Note that the rule format is the same but is simply extended using the defined ESA format. It was noted that future implementations might use a different character to delineate the rule sections for ease of modification and readability.
Reference Implementation
The ESA was implemented and tested using a simple rule list containing 67 items that were derived from the rule set used by the Porter Stemming Algorithm (Porter, M.F., 1980).
It was noted during implementation that organizing the rules in a tree structure would lend itself to the highest performance possible. This was due to the fact that a simple tree search could be performed to determine the rule(s) to be used for any given token.
However, for the sake of simplicity the rules were simply organized into an array where they could be tested iteratively against a stream of tokens. It was found that even use of this low-performance design still resulted in quick, powerful stemming.
The ESA test performance time was 0.000058 seconds per word on a Pentium Celeron 500 MHz system. The reference implementation has currently only been compared to the Porter Stemming Algorithm which has a performance time of 0.000072 seconds per word on the same system.
Comparison
The ESA is comparable to the Paice/Husk or Lancaster Stemming Algorithm (Paice, C. Husk, G. 1980) in that it uses a series of rules. However ESA is able to use a shorter (and therefore faster) rule set than Paice/Husk due to its ability to have extended and interdependent rules.
The ESA is also comparable to the Porter Stemming Algorithm in that it uses a similar rule set to accomplish stemming. The ESA is simpler and faster because it uses a single step to process tokens unlike the multi-step process that Porter used.
ESA is superior to both the Lovins (Lovins, J.B. 1968) and Dawson (Dawson, J.L. 1974) Stemming Algorithms in the same fashion that Paice/Husk is superior to them. It is not hampered by post processing (as in Lovins) and ‘partial-matching’ (as found in Dawson). It is also not restricted to a set list of suffixes as found in both of these algorithms.
The ESA is also much faster than such inflectional linguistic morphology algorithms as Krovetz (Krovetz, B. 1993). It is also much more reliable due to the fact that the English language does not lend itself inflectional linguistic morphology because it is a weak morphological language.
Download reference-implementation source here: download