|
Wil Baden 2001-08-23
Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137.
Provenance: Written by B. Frakes and C. Fox, 1986-1991.
Forthed by Wil Baden from Information Retrieval, Frakes and Baeza-Yates, 1992, ISBN 0-13-463837-9.
Program Text 1\ SFX Aid to defining suffixes in rule.
: SFX ( "string" -- )
PARSE-WORD dup >R STRING, 7 R> - ALLOT ;
\ LAMBDA Empty suffix.
: LAMBDA ( -- )
0 , 0 , ;
0 CONSTANT NULL
\ Fields in rule.
0
1 CELLS BOUNDS FIELD ->Rule-ID
2 CELLS BOUNDS FIELD ->Old-End
2 CELLS BOUNDS FIELD ->New-End
1 CELLS BOUNDS FIELD ->Old-Offset
1 CELLS BOUNDS FIELD ->New-Offset
1 CELLS BOUNDS FIELD ->Min-Root-Size
1 CELLS BOUNDS FIELD ->Condition
FIELD ->Next-Rule
Returns: int -- a weird count of word size in adjusted syllables
Purpose: Count syllables in a special way: count the number vowel-consonant pairs in a word, disregarding initial consonants and final vowels. The letter "y" counts as a consonant at the beginning of a word and when it has a vowel in front of it; otherwise (when it follows a consonant) it is treated as a vowel. For example, the WordSize of "cat" is 1, of "any" is 1, of "amount" is 2, of "anything" is 3.
Plan: Run a DFA to compute the word size
Notes: The easiest and fastest way to compute this funny measure is with a finite state machine. The initial state 0 checks the first letter. If it is a vowel, then the machine changes to state 1, which is the "last letter was a vowel" state. If the first letter is a consonant or y, then it changes to state 2, the "last letter was a consonant state". In state 1, a y is treated as a consonant (since it follows a vowel), but in state 2, y is treated as a vowel (since it follows a consonant. The result counter is incremented on the transition from state 1 to state 2, since this transition only occurs after a vowel-consonant pair, which is what we are counting.
Program Text 2: IS-VOWEL ( char -- flag )
dup [char] a = ORIF
dup [char] e = ORIF
dup [char] i = ORIF
dup [char] o = ORIF
dup [char] u = THEN THEN THEN THEN
NIP ;
: Word-Size ( str len -- measure )
0 0 2SWAP BOUNDS ?DO ( result state)
CASE
0 OF
I C@ IS-VOWEL IF 1 ELSE 2 THEN
ENDOF
1 OF
I C@ IS-VOWEL IF 1 ELSE 1+ 2 THEN
ENDOF
2 OF
I C@ IS-VOWEL ORIF I C@ [char] y = THEN
IF 1 ELSE 2 THEN
ENDOF
ENDCASE
LOOP DROP ( result)
;
[VOID] [IF] ................................................................
: Word-Size ( str len -- measure )
0 ROT ROT ( result str len)
over C@ IS-VOWEL NOT IF GOTO 2 THEN
LABEL 1 \ Vowel
BEGIN
1 /STRING dup 0= IF GOTO exit THEN
over C@ IS-VOWEL NOT
UNTIL
2>R 1+ 2R>
LABEL 2 \ Consonant
BEGIN
1 /STRING dup 0= IF GOTO exit THEN
I C@ IS-VOWEL ORIF I C@ [char] y = THEN
UNTIL
GOTO 1
LABEL exit
2DROP ;
..................................................................... [THEN]
Returns: int -- TRUE if the word parameter contains a vowel, FALSE otherwise.
Purpose: Some of the rewrite rules apply only to a root containing a vowel, where a vowel is one of "aeiou" or y with a consonant in front of it.
Plan: Obviously, under the definition of a vowel, a word contains a vowel iff either its first letter is one of "aeiou", or any of its other letters are "aeiouy". The plan is to test this condition.
Returns: int -- TRUE if the current word ends with a consonant-vowel-consonant combination, and the second consonant is not w, x, or y, FALSE (0) otherwise.
Purpose: Some of the rewrite rules apply only to a root with this characteristic.
Plan: Look at the last three characters.
Notes: None
Returns: int -- TRUE if the current word meets special conditions for adding an e.
Purpose: Rule 122 applies only to a root with this characteristic.
Plan: Check for size of 1 and a consonant-vowel-consonant ending.
Notes: None
Returns: int -- TRUE if the current word meets special conditions for removing an e.
Purpose: Rule 502 applies only to a root with this characteristic.
Plan: Check for size of 1 and no consonant-vowel-consonant ending.
Program Text 3: Contains-Vowel ( str len -- flag )
dup 0= IF 2DROP FALSE EXIT THEN
over C@ IS-VOWEL IF 2DROP TRUE EXIT THEN
1 /STRING SCAN[ IS-VOWEL ORIF over C@ [char] y = THEN ]SCAN
NIP 0<> ;
: Ends-with-CVC ( str len -- flag )
dup 2 < IF 2DROP FALSE EXIT THEN
+ ( end+1)
1- S" aeiouwxy" third C@ SCAN NIP 0= ANDIF
1- S" aeiouy" third C@ SCAN NIP 0<> ANDIF
1- S" aeiou" third C@ SCAN NIP 0= THEN THEN
NIP ;
: Add-an-E ( str len -- flag )
2dup Word-Size 1 = ANDIF 2dup Ends-with-CVC THEN
NIP NIP ;
: Remove-an-E ( str len -- flag )
2dup Word-Size 1 = ANDIF 2dup Ends-with-CVC NOT THEN
NIP NIP ;
CREATE Step1A-Rules
101 , SFX sses SFX ss 3 , 1 , -1 , NULL ,
102 , SFX ies SFX i 2 , 0 , -1 , NULL ,
103 , SFX ss SFX ss 1 , 1 , -1 , NULL ,
104 , SFX s LAMBDA 0 , -1 , -1 , NULL ,
000 ,
CREATE Step1B-Rules
105 , SFX eed SFX ee 2 , 1 , 0 , NULL ,
106 , SFX ed LAMBDA 1 , -1 , -1 , NULL ,
107 , SFX ing LAMBDA 2 , -1 , -1 , NULL ,
000 ,
CREATE Step1B1-Rules
108 , SFX at SFX ate 1 , 2 , -1 , NULL ,
109 , SFX BL SFX ble 1 , 2 , -1 , NULL ,
110 , SFX iz SFX ize 1 , 2 , -1 , NULL ,
111 , SFX bb SFX b 1 , 0 , -1 , NULL ,
112 , SFX dd SFX d 1 , 0 , -1 , NULL ,
113 , SFX ff SFX f 1 , 0 , -1 , NULL ,
114 , SFX gg SFX g 1 , 0 , -1 , NULL ,
115 , SFX mm SFX m 1 , 0 , -1 , NULL ,
116 , SFX nn SFX n 1 , 0 , -1 , NULL ,
117 , SFX pp SFX p 1 , 0 , -1 , NULL ,
118 , SFX rr SFX r 1 , 0 , -1 , NULL ,
119 , SFX tt SFX t 1 , 0 , -1 , NULL ,
120 , SFX ww SFX w 1 , 0 , -1 , NULL ,
121 , SFX xx SFX x 1 , 0 , -1 , NULL ,
122 , LAMBDA SFX e -1 , 0 , -1 , ' Add-an-E ,
000 ,
CREATE Step1C-Rules
123 , SFX y SFX i 0 , 0 , -1 , ' Contains-Vowel ,
000 ,
CREATE Step2-Rules
203 , SFX ational SFX ate 6 , 2 , 0 , NULL ,
204 , SFX tional SFX tion 5 , 3 , 0 , NULL ,
205 , SFX enci SFX ence 3 , 3 , 0 , NULL ,
206 , SFX anci SFX ance 3 , 3 , 0 , NULL ,
207 , SFX izer SFX ize 3 , 2 , 0 , NULL ,
208 , SFX abli SFX able 3 , 3 , 0 , NULL ,
209 , SFX alli SFX al 3 , 1 , 0 , NULL ,
210 , SFX entli SFX ent 4 , 2 , 0 , NULL ,
211 , SFX eli SFX e 2 , 0 , 0 , NULL ,
213 , SFX ousli SFX ous 4 , 2 , 0 , NULL ,
214 , SFX ization SFX ize 6 , 2 , 0 , NULL ,
215 , SFX ation SFX ate 4 , 2 , 0 , NULL ,
216 , SFX ator SFX ate 3 , 2 , 0 , NULL ,
217 , SFX alism SFX al 4 , 1 , 0 , NULL ,
218 , SFX iveness SFX ive 6 , 2 , 0 , NULL ,
219 , SFX fulnes SFX ful 5 , 2 , 0 , NULL ,
220 , SFX ousness SFX ous 6 , 2 , 0 , NULL ,
221 , SFX aliti SFX al 4 , 1 , 0 , NULL ,
222 , SFX iviti SFX ive 4 , 2 , 0 , NULL ,
223 , SFX biliti SFX ble 5 , 2 , 0 , NULL ,
000 ,
CREATE Step3-Rules
301 , SFX icate SFX ic 4 , 1 , 0 , NULL ,
302 , SFX ative LAMBDA 4 , -1 , 0 , NULL ,
303 , SFX alize SFX al 4 , 1 , 0 , NULL ,
304 , SFX iciti SFX ic 4 , 1 , 0 , NULL ,
305 , SFX ical SFX ic 3 , 1 , 0 , NULL ,
308 , SFX ful LAMBDA 2 , -1 , 0 , NULL ,
309 , SFX ness LAMBDA 3 , -1 , 0 , NULL ,
000 ,
CREATE Step4-Rules
401 , SFX al LAMBDA 1 , -1 , 1 , NULL ,
402 , SFX ance LAMBDA 3 , -1 , 1 , NULL ,
403 , SFX ence LAMBDA 3 , -1 , 1 , NULL ,
405 , SFX er LAMBDA 1 , -1 , 1 , NULL ,
406 , SFX ic LAMBDA 1 , -1 , 1 , NULL ,
407 , SFX able LAMBDA 3 , -1 , 1 , NULL ,
408 , SFX ible LAMBDA 3 , -1 , 1 , NULL ,
409 , SFX ant LAMBDA 2 , -1 , 1 , NULL ,
410 , SFX ement LAMBDA 4 , -1 , 1 , NULL ,
411 , SFX ment LAMBDA 3 , -1 , 1 , NULL ,
412 , SFX emt LAMBDA 2 , -1 , 1 , NULL ,
423 , SFX sion SFX s 3 , 0 , 1 , NULL ,
424 , SFX tion SFX t 3 , 0 , 1 , NULL ,
415 , SFX ou LAMBDA 1 , -1 , 1 , NULL ,
416 , SFX ism LAMBDA 2 , -1 , 1 , NULL ,
417 , SFX ate LAMBDA 2 , -1 , 1 , NULL ,
418 , SFX iti LAMBDA 2 , -1 , 1 , NULL ,
419 , SFX ous LAMBDA 2 , -1 , 1 , NULL ,
420 , SFX ive LAMBDA 2 , -1 , 1 , NULL ,
421 , SFX ize LAMBDA 2 , -1 , 1 , NULL ,
000 ,
CREATE Step5A-Rules
501 , SFX e LAMBDA 0 , -1 , 1 , NULL ,
502 , SFX e LAMBDA 0 , -1 , -1 , ' Remove-an-E ,
000 ,
CREATE Step5B-Rules
503 , SFX ll SFX l 1 , 0 , 1 , NULL ,
000 ,
Returns: int -- the id for the rule fired, 0 is none is fired
Purpose: Apply a set of rules to replace the suffix of a word
Plan: Loop through the rule set until a match meeting all conditions is found. If a rule fires, return its id, otherwise return 0. Conditions on the length of the root are checked as part of this function's processing because this check is so often made.
Notes: This is the main routine driving the stemmer. It goes through a set of suffix replacement rules looking for a match on the current suffix. When it finds one, if the root of the word is long enough, and it meets whatever other conditions are required, then the suffix is replaced, and the function returns.
Program Text 4: Replace-End ( str len rule -- str len-i rule-id )
BEGIN dup ->Rule-ID @ WHILE
3dup >R over SWAP 1- + R> ->Old-Offset @ - ( . . . str ending)
dup >R ( R: ending)
> NOT IF ( str len rule)
dup ->Old-End COUNT R@ over COMPARE 0= IF
3dup >R Word-Size R> ->Min-Root-Size @ > IF
dup ->Condition @ 0= ORIF
3dup ->Condition @ EXECUTE THEN
IF
dup ->New-End COUNT R@ SWAP MOVE
>R R@ ->Old-Offset @ - R@ ->New-Offset @ +
R> ->Rule-ID @
R> DROP
EXIT THEN
THEN
THEN
THEN
R> DROP
->Next-Rule
REPEAT
DROP 0 ;
Returns: int -- FALSE if the word contains non-alphabetic characters and hence is not stemmed, TRUE otherwise
Purpose: Stem a word
Plan: Part 1: Check to ensure the word is all alphabetic Part 2: Run through the Porter algorithm Part 3: Return an indication of successful stemming
Notes: This function implements the Porter stemming algorithm, with a few additions here and there. See:
Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137.
Porter's algorithm is an ad hoc set of rewrite rules with various conditions on rule firing. The terminology of "step 1a" and so on, is taken directly from Porter's article, which unfortunately gives almost no justification for the various steps. Thus this function more or less faithfully refects the opaque presentation in the article. Changes from the article amount to a few additions to the rewrite rules; these are marked in the RuleList data structures with comments.
Program Text 5 : >PAD ( str len -- PAD len) >R PAD R@ MOVE PAD R> ;
: STEM ( str len -- PAD len-i flag )
>PAD
2dup SCAN[ IS-ALPHA NOT ]SCAN NIP IF FALSE EXIT THEN
2dup STRING>LOWER
Step1A-Rules Replace-End DROP
Step1B-Rules Replace-End
dup 106 = ORIF dup 107 = THEN NIP IF
Step1B1-Rules Replace-End DROP
THEN
Step1C-Rules Replace-End DROP
Step2-Rules Replace-End DROP
Step3-Rules Replace-End DROP
Step4-Rules Replace-End DROP
Step5A-Rules Replace-End DROP
Step5B-Rules Replace-End DROP
TRUE ;