Suffix List

Use the Suffix List of the Translate Table to define a list of passes containing suffix patterns used in the stemming process.

The stemming process allows a user to match words based upon their roots. For instance, the user may wish to consider “banker” and “banking” a match because they share the same root, “bank”. To accomplish this, the trailing characters of a word are compared against different patterns representing suffixes that are to be removed. The comparisons are made on a character by character basis with no consideration for upper or lower case.

In addition to alpha-numeric characters, four wildcards and one special character are supported by the matching algorithm.

Wildcard Characters

Character

Description

?

match any single character

%

match any single consonant. ( b - d, f - h, j - n, p - t, v - z )

@

match any single vowel ( a, e, i, o, u )

#

match any single digit ( 0 - 9 )

!

next character must appear twice

This stemming process, as it is called, is performed in a series of one or more passes. If a match is found, the suffix is removed from the word. In addition to identifying a pattern for suffix removal, the designer may also identify a replacement string to be attached to the root after the suffix is removed. Occasionally, the designer may not want to strip, from the end of the word, all the characters that were matched by the pattern. To allow for this, the replacement string may contain the following special character one or more times.

Retention Character

Character

Description

.

Used to cause the character previously stripped from a location to be reattached.

Once a match has been made during any pass, the remaining patterns defined for the pass are ignored and the process moves on to the next pass.

Suffix List Columns Defintions

Value

Description

Threshold*

Number of characters required to be in a word for current pattern-replacement processing to take place. (Overrides Process Threshold if greater)

Pattern

ASCII pattern to be matched as a suffix for replacement.

Replacement

ASCII string to be used as a suffix to replace matched pattern.

*To help prevent a portion of the root being mistaken for a suffix, a threshold size is defined for each match. If the number of characters in the word is less that or equal to the threshold, it will not be tested against that pattern. This is similar to the process threshold, except that it is defined separately for each suffix pattern, allowing for more exact tuning of the stemming process. A threshold size set less than the process threshold size would be the same as setting the threshold equal to the process threshold size.

Example of pattern matching replacement ( Single Pass )

Pass 1

Rule

Threshold

Pattern

Replacement

Description

1

6

!%ing

.

compress double consonant

2

6

%cing

.ce

replace "ing" with "e"

3

6

%%ing

..

no substitution for “ing”

4

6

!%@%ing

....

no substitution for “ing”

5

6

%@%ing

...e

replace “ing” with “e”

6

6

@@%ing

...

no substitution for “ing”

Result

Rule

Word

Stemmed

1

clapping

clap

2

fencing

fence

3

punting

punt

4

flattening

flatten

5

stoning

stone

6

waiting

wait

Example of pattern matching replacement ( Multi-Pass )

Pass 1

Rule

Threshold

Pattern

Replacement

Description

1

6

!%ing

.

compress double consonant

2

6

%cing

.ce

replace "ing" with "e"

3

6

%%ing

..

no substitution for “ing”

4

6

!%@%ing

....

no substitution for “ing”

5

6

%@%ing

...e

replace “ing” with “e”

6

6

@@%ing

...

no substitution for “ing”

Pass 2

Rule

Threshold

Pattern

Replacement

Description

1

6

!%en

.

compress double consonant

Result

Rule

Word

Stemmed

1

clapping

clap

2

fencing

fence

3

punting

punt

4

flattening

flat

5

stoning

stone

6

waiting

wait