This is G o o g l e's cache of http://home.earthlink.net/~neilbawd/stem.html.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:syFjAwVShvkC:home.earthlink.net/~neilbawd/stem.html+&hl=en&ie=UTF-8


Google is not affiliated with the authors of this page nor responsible for its content.

Porter Stemmer

Porter Stemmer

Get TEXT

Wil Baden 2001-08-23

Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137.

Provenance: Written by B. Frakes and C. Fox, 1986-1991.

Forthed by Wil Baden from Information Retrieval, Frakes and Baeza-Yates, 1992, ISBN 0-13-463837-9.

Program Text 1
 
\ SFX  Aid to defining suffixes in rule.

: SFX       ( "string" -- )
    PARSE-WORD  dup >R  STRING,  7 R> - ALLOT ;

\  LAMBDA  Empty suffix.

: LAMBDA    ( -- )
    0 , 0 , ;

0 CONSTANT NULL

\  Fields in rule.

0
1 CELLS BOUNDS FIELD ->Rule-ID
2 CELLS BOUNDS FIELD ->Old-End
2 CELLS BOUNDS FIELD ->New-End
1 CELLS BOUNDS FIELD ->Old-Offset
1 CELLS BOUNDS FIELD ->New-Offset
1 CELLS BOUNDS FIELD ->Min-Root-Size
1 CELLS BOUNDS FIELD ->Condition
               FIELD ->Next-Rule


Word-Size ( str len -- measure )

Returns: int -- a weird count of word size in adjusted syllables

Purpose: Count syllables in a special way: count the number vowel-consonant pairs in a word, disregarding initial consonants and final vowels. The letter "y" counts as a consonant at the beginning of a word and when it has a vowel in front of it; otherwise (when it follows a consonant) it is treated as a vowel. For example, the WordSize of "cat" is 1, of "any" is 1, of "amount" is 2, of "anything" is 3.

Plan: Run a DFA to compute the word size

Notes: The easiest and fastest way to compute this funny measure is with a finite state machine. The initial state 0 checks the first letter. If it is a vowel, then the machine changes to state 1, which is the "last letter was a vowel" state. If the first letter is a consonant or y, then it changes to state 2, the "last letter was a consonant state". In state 1, a y is treated as a consonant (since it follows a vowel), but in state 2, y is treated as a vowel (since it follows a consonant. The result counter is incremented on the transition from state 1 to state 2, since this transition only occurs after a vowel-consonant pair, which is what we are counting.

Program Text 2
 
: IS-VOWEL         ( char -- flag )
    dup [char] a = ORIF
    dup [char] e = ORIF
    dup [char] i = ORIF
    dup [char] o = ORIF
    dup [char] u = THEN THEN THEN THEN
    NIP ;

: Word-Size          ( str len -- measure )
    0 0 2SWAP BOUNDS ?DO             ( result state)
        CASE
        0 OF
            I C@ IS-VOWEL IF  1  ELSE  2  THEN
            ENDOF
        1 OF
            I C@ IS-VOWEL IF  1  ELSE  1+  2  THEN
            ENDOF
        2 OF
            I C@ IS-VOWEL  ORIF  I C@ [char] y =  THEN
            IF  1  ELSE  2  THEN
            ENDOF
        ENDCASE
    LOOP DROP                ( result)
    ;

[VOID] [IF] ................................................................
: Word-Size          ( str len -- measure )
    0 ROT ROT                   ( result str len)
    over C@ IS-VOWEL NOT IF  GOTO 2  THEN
    LABEL 1  \  Vowel
        BEGIN
            1 /STRING  dup 0= IF  GOTO exit  THEN
            over C@ IS-VOWEL NOT
        UNTIL
    2>R  1+  2R>
    LABEL 2  \  Consonant
        BEGIN
            1 /STRING  dup 0= IF  GOTO exit  THEN
            I C@ IS-VOWEL  ORIF  I C@ [char] y =  THEN
        UNTIL
    GOTO 1
    LABEL exit
    2DROP ;
..................................................................... [THEN]


Contains-Vowel ( str len -- flag )

Returns: int -- TRUE if the word parameter contains a vowel, FALSE otherwise.

Purpose: Some of the rewrite rules apply only to a root containing a vowel, where a vowel is one of "aeiou" or y with a consonant in front of it.

Plan: Obviously, under the definition of a vowel, a word contains a vowel iff either its first letter is one of "aeiou", or any of its other letters are "aeiouy". The plan is to test this condition.

Ends-with-CVC ( str len -- flag )

Returns: int -- TRUE if the current word ends with a consonant-vowel-consonant combination, and the second consonant is not w, x, or y, FALSE (0) otherwise.

Purpose: Some of the rewrite rules apply only to a root with this characteristic.

Plan: Look at the last three characters.

Notes: None

Add-an-E ( str len -- flag )

Returns: int -- TRUE if the current word meets special conditions for adding an e.

Purpose: Rule 122 applies only to a root with this characteristic.

Plan: Check for size of 1 and a consonant-vowel-consonant ending.

Notes: None

Remove-an-E ( str len -- flag )

Returns: int -- TRUE if the current word meets special conditions for removing an e.

Purpose: Rule 502 applies only to a root with this characteristic.

Plan: Check for size of 1 and no consonant-vowel-consonant ending.

Program Text 3
 
: Contains-Vowel     ( str len -- flag )
    dup 0= IF  2DROP  FALSE  EXIT THEN
    over C@ IS-VOWEL IF  2DROP  TRUE  EXIT THEN
    1 /STRING  SCAN[ IS-VOWEL  ORIF  over C@ [char] y =  THEN ]SCAN
    NIP 0<> ;

: Ends-with-CVC      ( str len -- flag )
    dup 2 < IF  2DROP  FALSE  EXIT THEN
    +                           ( end+1)
    1- S" aeiouwxy" third C@ SCAN NIP 0= ANDIF
    1- S" aeiouy" third C@ SCAN NIP 0<> ANDIF
    1- S" aeiou" third C@ SCAN NIP 0= THEN THEN
    NIP ;

: Add-an-E           ( str len -- flag )
    2dup Word-Size 1 =  ANDIF  2dup Ends-with-CVC  THEN
    NIP NIP ;

: Remove-an-E        ( str len -- flag )
    2dup Word-Size 1 =  ANDIF  2dup Ends-with-CVC NOT  THEN
    NIP NIP ;

CREATE Step1A-Rules
    101 ,  SFX sses     SFX ss        3 ,  1 , -1 ,  NULL ,
    102 ,  SFX ies      SFX i         2 ,  0 , -1 ,  NULL ,
    103 ,  SFX ss       SFX ss        1 ,  1 , -1 ,  NULL ,
    104 ,  SFX s        LAMBDA        0 , -1 , -1 ,  NULL ,
    000 ,

CREATE Step1B-Rules
    105 ,  SFX eed      SFX ee        2 ,  1 ,  0 ,  NULL ,
    106 ,  SFX ed       LAMBDA        1 , -1 , -1 ,  NULL ,
    107 ,  SFX ing      LAMBDA        2 , -1 , -1 ,  NULL ,
    000 ,

CREATE Step1B1-Rules
    108 ,  SFX at       SFX ate       1 ,  2 , -1 ,  NULL ,
    109 ,  SFX BL       SFX ble       1 ,  2 , -1 ,  NULL ,
    110 ,  SFX iz       SFX ize       1 ,  2 , -1 ,  NULL ,
    111 ,  SFX bb       SFX b         1 ,  0 , -1 ,  NULL ,
    112 ,  SFX dd       SFX d         1 ,  0 , -1 ,  NULL ,
    113 ,  SFX ff       SFX f         1 ,  0 , -1 ,  NULL ,
    114 ,  SFX gg       SFX g         1 ,  0 , -1 ,  NULL ,
    115 ,  SFX mm       SFX m         1 ,  0 , -1 ,  NULL ,
    116 ,  SFX nn       SFX n         1 ,  0 , -1 ,  NULL ,
    117 ,  SFX pp       SFX p         1 ,  0 , -1 ,  NULL ,
    118 ,  SFX rr       SFX r         1 ,  0 , -1 ,  NULL ,
    119 ,  SFX tt       SFX t         1 ,  0 , -1 ,  NULL ,
    120 ,  SFX ww       SFX w         1 ,  0 , -1 ,  NULL ,
    121 ,  SFX xx       SFX x         1 ,  0 , -1 ,  NULL ,
    122 ,  LAMBDA       SFX e        -1 ,  0 , -1 ,  ' Add-an-E ,
    000 ,

CREATE Step1C-Rules
    123 ,  SFX y        SFX i         0 ,  0 , -1 ,  ' Contains-Vowel ,
    000 ,

CREATE Step2-Rules
    203 ,  SFX ational  SFX ate       6 ,  2 ,  0 ,  NULL ,
    204 ,  SFX tional   SFX tion      5 ,  3 ,  0 ,  NULL ,
    205 ,  SFX enci     SFX ence      3 ,  3 ,  0 ,  NULL ,
    206 ,  SFX anci     SFX ance      3 ,  3 ,  0 ,  NULL ,
    207 ,  SFX izer     SFX ize       3 ,  2 ,  0 ,  NULL ,
    208 ,  SFX abli     SFX able      3 ,  3 ,  0 ,  NULL ,
    209 ,  SFX alli     SFX al        3 ,  1 ,  0 ,  NULL ,
    210 ,  SFX entli    SFX ent       4 ,  2 ,  0 ,  NULL ,
    211 ,  SFX eli      SFX e         2 ,  0 ,  0 ,  NULL ,
    213 ,  SFX ousli    SFX ous       4 ,  2 ,  0 ,  NULL ,
    214 ,  SFX ization  SFX ize       6 ,  2 ,  0 ,  NULL ,
    215 ,  SFX ation    SFX ate       4 ,  2 ,  0 ,  NULL ,
    216 ,  SFX ator     SFX ate       3 ,  2 ,  0 ,  NULL ,
    217 ,  SFX alism    SFX al        4 ,  1 ,  0 ,  NULL ,
    218 ,  SFX iveness  SFX ive       6 ,  2 ,  0 ,  NULL ,
    219 ,  SFX fulnes   SFX ful       5 ,  2 ,  0 ,  NULL ,
    220 ,  SFX ousness  SFX ous       6 ,  2 ,  0 ,  NULL ,
    221 ,  SFX aliti    SFX al        4 ,  1 ,  0 ,  NULL ,
    222 ,  SFX iviti    SFX ive       4 ,  2 ,  0 ,  NULL ,
    223 ,  SFX biliti   SFX ble       5 ,  2 ,  0 ,  NULL ,
    000 ,

CREATE Step3-Rules
    301 ,  SFX icate    SFX ic        4 ,  1 ,  0 ,  NULL ,
    302 ,  SFX ative    LAMBDA        4 , -1 ,  0 ,  NULL ,
    303 ,  SFX alize    SFX al        4 ,  1 ,  0 ,  NULL ,
    304 ,  SFX iciti    SFX ic        4 ,  1 ,  0 ,  NULL ,
    305 ,  SFX ical     SFX ic        3 ,  1 ,  0 ,  NULL ,
    308 ,  SFX ful      LAMBDA        2 , -1 ,  0 ,  NULL ,
    309 ,  SFX ness     LAMBDA        3 , -1 ,  0 ,  NULL ,
    000 ,

CREATE Step4-Rules
    401 ,  SFX al       LAMBDA        1 , -1 ,  1 ,  NULL ,
    402 ,  SFX ance     LAMBDA        3 , -1 ,  1 ,  NULL ,
    403 ,  SFX ence     LAMBDA        3 , -1 ,  1 ,  NULL ,
    405 ,  SFX er       LAMBDA        1 , -1 ,  1 ,  NULL ,
    406 ,  SFX ic       LAMBDA        1 , -1 ,  1 ,  NULL ,
    407 ,  SFX able     LAMBDA        3 , -1 ,  1 ,  NULL ,
    408 ,  SFX ible     LAMBDA        3 , -1 ,  1 ,  NULL ,
    409 ,  SFX ant      LAMBDA        2 , -1 ,  1 ,  NULL ,
    410 ,  SFX ement    LAMBDA        4 , -1 ,  1 ,  NULL ,
    411 ,  SFX ment     LAMBDA        3 , -1 ,  1 ,  NULL ,
    412 ,  SFX emt      LAMBDA        2 , -1 ,  1 ,  NULL ,
    423 ,  SFX sion     SFX s         3 ,  0 ,  1 ,  NULL ,
    424 ,  SFX tion     SFX t         3 ,  0 ,  1 ,  NULL ,
    415 ,  SFX ou       LAMBDA        1 , -1 ,  1 ,  NULL ,
    416 ,  SFX ism      LAMBDA        2 , -1 ,  1 ,  NULL ,
    417 ,  SFX ate      LAMBDA        2 , -1 ,  1 ,  NULL ,
    418 ,  SFX iti      LAMBDA        2 , -1 ,  1 ,  NULL ,
    419 ,  SFX ous      LAMBDA        2 , -1 ,  1 ,  NULL ,
    420 ,  SFX ive      LAMBDA        2 , -1 ,  1 ,  NULL ,
    421 ,  SFX ize      LAMBDA        2 , -1 ,  1 ,  NULL ,
    000 ,

CREATE Step5A-Rules
    501 ,  SFX e        LAMBDA        0 , -1 ,  1 ,  NULL ,
    502 ,  SFX e        LAMBDA        0 , -1 , -1 ,  ' Remove-an-E ,
    000 ,

CREATE Step5B-Rules
    503 ,  SFX ll       SFX l         1 ,  0 ,  1 ,  NULL ,
    000 ,


Replace-End ( str len rule -- str len-i rule-id )

Returns: int -- the id for the rule fired, 0 is none is fired

Purpose: Apply a set of rules to replace the suffix of a word

Plan: Loop through the rule set until a match meeting all conditions is found. If a rule fires, return its id, otherwise return 0. Conditions on the length of the root are checked as part of this function's processing because this check is so often made.

Notes: This is the main routine driving the stemmer. It goes through a set of suffix replacement rules looking for a match on the current suffix. When it finds one, if the root of the word is long enough, and it meets whatever other conditions are required, then the suffix is replaced, and the function returns.

Program Text 4
 
: Replace-End        ( str len rule -- str len-i rule-id )
    BEGIN  dup ->Rule-ID @ WHILE
        3dup >R  over SWAP 1- +  R> ->Old-Offset @ -    ( . . . str ending)
        dup >R                                            ( R: ending)
        > NOT IF                                        ( str len rule)
            dup ->Old-End COUNT   R@ over  COMPARE 0= IF
                3dup >R  Word-Size  R> ->Min-Root-Size @  > IF
                    dup ->Condition @ 0=  ORIF
                    3dup ->Condition @ EXECUTE  THEN
                    IF
                        dup ->New-End COUNT R@ SWAP MOVE
                        >R  R@ ->Old-Offset @ -  R@ ->New-Offset @ +
                        R> ->Rule-ID @
                        R> DROP
                    EXIT THEN
                THEN
            THEN
        THEN
        R> DROP
        ->Next-Rule
    REPEAT
    DROP 0 ;


Stem ( str len -- PAD len-i flag )

Returns: int -- FALSE if the word contains non-alphabetic characters and hence is not stemmed, TRUE otherwise

Purpose: Stem a word

Plan: Part 1: Check to ensure the word is all alphabetic Part 2: Run through the Porter algorithm Part 3: Return an indication of successful stemming

Notes: This function implements the Porter stemming algorithm, with a few additions here and there. See:

Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137.

Porter's algorithm is an ad hoc set of rewrite rules with various conditions on rule firing. The terminology of "step 1a" and so on, is taken directly from Porter's article, which unfortunately gives almost no justification for the various steps. Thus this function more or less faithfully refects the opaque presentation in the article. Changes from the article amount to a few additions to the rewrite rules; these are marked in the RuleList data structures with comments.

Program Text 5
 
    : >PAD  ( str len -- PAD len)  >R  PAD R@ MOVE  PAD R> ;

: STEM             ( str len -- PAD len-i flag )

    >PAD
    2dup SCAN[ IS-ALPHA NOT ]SCAN NIP IF  FALSE EXIT  THEN
    2dup STRING>LOWER

    Step1A-Rules Replace-End DROP
    Step1B-Rules Replace-End
    dup 106 = ORIF dup 107 = THEN NIP IF
        Step1B1-Rules Replace-End DROP
    THEN
    Step1C-Rules Replace-End DROP

    Step2-Rules Replace-End DROP

    Step3-Rules Replace-End DROP

    Step4-Rules Replace-End DROP

    Step5A-Rules Replace-End DROP
    Step5B-Rules Replace-End DROP

    TRUE ;



Go back to home page.