Subject: Re: Match.fs Date: Sun, 20 Aug 2000 13:53:48 GMT From: ecbm@cc.newcastle.edu.au (Bruce McFarling) Organization: The University of Newcastle Newsgroups: comp.lang.forth \ match.fs MATCH.0.0.0 \ NB: pre-alpha. \ Consider this a work in progress. Each word has been tested \ in some context and at some stage in its development. No \ words have been tested exhaustively, and the most recent \ renaming of words for public posting may well have broken \ some code that was more or less working previously. \ This is made public in order to encourage comment and \ rebuttal. It is not a release. 0 [IF] ======================================================== A Forth Pattern Matching Philosophy The general philosophy of General Regular Expression pattern matching is that the person's job is to "express" a pattern, and the computer's job is to determine which patterns match that expressions. My experience in using general regular expressions is that this is not what I normally *want* to do. What I want to do is to instruct the computer how to find the pattern I am looking for. Therefore, I first have to perform the task of translating my intentions into a general expressions of a pattern, so that the computer can translate it into something that is supposed to correspond to my original intention. The alternative envisioned here is to combine the general flexibility of the BNF parser due to Brad Rodriguez with a pattern operator approach similar to the SCAN and SKIP operators for character matching. Secondarily, the alternative prusued here may demonstrate that no arcane return stack manipulations are required in order to support backtracking, if a stack picture can be created which makes backtracking relatively straightforward. -------------------------------------------------------- [THEN] 0 [IF] ======================================================== This is the beginning of a toolkit for defining pattern matching words. The fundamental idea is borrowed from Brad Rodriguez' "BNF in Forth". This is to permit automatic concatenation of pattern matching words by maintaining a match state. Brad maintains the match state in a state variable, while I maintain it on the top of stack. This is because of a higher priority to provide for generic reentrancy in a general string pattern matching toolkit, compared to a BNF parser. Maintaining the entire state of the current match on the stack provides this reentrancy. The state is maintained by a "slider" on the stack. The slider is created from a string address and count on stack by the word "/[", and after pattern matching is complete the results are extracted by the words "]/", "]\", "]\/", and "]/\". A simple pattern matching looks like: S" 123test" /[ /digits/ ]/\ cr type cr type with a result of 123 test ok The meanings of the four terminators are: /[ pattern ]\ return matching string, if any /[ pattern ]/ return string following match /[ pattern ]/\ return both, with match on top /[ pattern ]\/ return both, with match below The current implementation of sliders, involving three character addresses and a flag, is discussed below. However, these five words provide a generic slider environment. By rewriting low level slider primitive definitions, an entirely compatible implementation could be provided based upon a different implementation of sliders. Within the slider environment, patterns can be executed in sequence, providing implicit concatenation of patterns. The explicit definition of alternative patterns is provided by the slider operator "||". This resets the slider if the pattern has failed, so that the following pattern may be attempted from the beginning. If the pattern has succeeded, this sets the slider as successfully completed, so that all further attempts at pattern matching are halted. The fundamental slider capability is selective execution or omission of pattern matching operations based upon the state of the slider. This is encapsulated by the pattern-matching definition words "/:" and ";/". The source for a word defined using these tools will have the form of: /: /name/ /pattern1/ /pattern2/ /pattern3/ ;/ There is no special requirement for patterns to be named with a leading and trailing '/' character, but this is encouraged. The notation is used to denote whether a well formed slider is on the stack on entering the word, and on leaving the word. A pattern defined using these words will automatically respect on entry the three possible states of the slider, and will only proceed if a match is in progress that has not yet failed. If the match fails within the word, the slider will also automatically be reset to the location it has when entering the word. When a word is defined using these specialised compilation words, the stack comment may be omitted, as the words typically require a stack picture of ( slider -- slider' ). Basic character matching is provided by the defining word "TOKEN", which defines a word to test a slider against a specific character. Additional words permit more general character tests to be performed. A pattern named with leading and trailing "'" denotes a single character match. 'ANY' Matches any single character /proceed? ( slider -- slider flag ) Generates a flag to determine whether or not a match is currently proceeding /more?/ the match fails if no more characters are available /GET ( slider -- slider c ) obtain the next character =/ ( slider c -- slider' ) match against the character on stack ?/ ( slider flag -- slider' ) update the slider based on the state of a flag TOKEN \ ( c ) TOKEN name defined a word to match against a specific character More complex patterns may be defined using specialised pattern compiling tools. The ones being trialed at present include the following (note that "/pattern/" may be a sequence of patterns, but should not include inclusive concatenation operators "|" or "||"): /pattern_1/ | /pattern_2/ | ... | /pattern_n/ If the match is proceeding, exit the word with a match If the match has failed, reset the string and start over /MAYBE /pattern/ MAYBE/ Match zero or one instances of /pattern/ /ANY /pattern/ ANY/ Match zero or more consecutive instances of /pattern/ /SCAN /pattern/ HEAD/ Match the first instance of /pattern/ in the remaining string. /SCAN /pattern/ TAIL/ Match the string following the first instance of /pattern/ in the remaining string. /RECURSE/ Repeat the current pattern matching word with the remaining string. -------------------------------------------------------- [THEN] 0 [IF] ======================================================== SLIDER IMPLEMENTATION In this , consisting of a four cell stack picture: ( ca_0 ca_n ca_i flag ) or ( slider ) where "ca_0" is the start of the string, ca_n is the character address ONE BEYOND the end of the string, ca_i is the current character address, and "flag" can have one of three states: a successful match in progress, a failed match in progress, and a completed successful match. Whenever a match in progress has failed or a successful match has been completed, the pattern matching words fall through without executing. The reason for the form of the slider is that it permits saving a state of the string in a single character address: if the current address "ca_i" is tucked under ca_n, the result is a slider beginning at the current position, with the original beginning of the string stored beneath: ( ca_0 ca_i ca_n ca_i flag ) or ( ca_0 slider' ) However, once the most primitive level of slider words have been definined, the pattern matching words work with the slider as a basic unit, so that an alternative form could be used. For example, an earlier version of this used a slider consisting of: ( ca_0 u index flag ) where the index runs from 0 to u. -------------------------------------------------------- [THEN] ONLY FORTH DEFINITIONS MARKER *MATCH* VOCABULARY MATCH ALSO MATCH DEFINITIONS 1 CHARS CONSTANT 1CHAR \ translate an address unit count to a character count \ note that the original value should be generated by operations \ on character pointers, so that ( au ) 1CHAR MOD is 0 : >CHARS ( au -- u ) [ 1CHAR 1 = ] [IF] ; [ELSE] [ 1CHAR 2 = ] [IF] 2/ ; [ELSE] [ 1CHAR 4 = ] [IF] 2/ 2/ ; [ELSE] 1CHAR / ; [THEN] [THEN] [THEN] \ * converts a pair of *character* pointers to a string : SPAN>S ( ca0 caN -- ca u ) OVER - >CHARS ; : NOT ( flag -- ~flag ) 0= ; : 4DROP ( a b c d -- ) 2DROP 2DROP ; : 4DUP ( a b c d -- a b c d a b c d ) 2OVER 2OVER ; \ lower level words used in definitions below \ flags giving various states 1 CONSTANT stall : /proceed? ( slider -- slider flag ) DUP TRUE = ; : /stall? ( slider -- slider flag ) DUP stall = ; : /halted? ( slider -- slider flag ) DUP TRUE <> ; : /fail? ( slider -- slider flag ) DUP 0= ; \ If we invert a slider state, we want to leave stall untouched : /not/ ( slider -- slider' ) /stall? NOT IF NOT THEN ; \ Three character addresses giving two spans in succession \ base to end of match, end of match to end of string : /result ( ca_0 ca_n ca_i -- ca_0 ca_i ca_n | ca_0 ca_0 ca_n ) ROT >R 0= IF DROP DUP THEN R> ; \ used to create a temporary marker : /track/ ( slider -- ca slider ) >R TUCK R> ; \ used when match completed to discard temporary marker \ and set match flag at success or failure : update/ ( ca_0 ca_i ca_n ca_i' -- slider' ) ROT DROP TRUE ; : backup/ ( ca_0 ca_i ca_n ca_i' -- slider' ) DROP SWAP FALSE ; : reset/ ( ca_0 ca_i ca_n ca_i' -- slider' ) DROP SWAP TRUE ; \ used when match fails to retain marker and try again : retry/ ( ca_0 ca_i ca_n ca_i' -- ca_0 slider' ) DROP OVER TRUE ; \ Interpreting words : /[ ( str0 -- ca0 ca_n ca_0 flag ) CHARS OVER + OVER TRUE ; : ]/ ( slider -- str' ) /result ROT DROP SPAN>S ; : ]\ ( str0 slider -- str0' ) /result DROP SPAN>S ; : ]\/ ( slider -- str0 str' ) /result >R DUP >R SPAN>S R> R> SPAN>S ; : ]/\ ( str0 slider -- str0' str' ) ]\/ 2SWAP ; : || ( str0 slider -- str0 slider' ) 0<> IF 1 EXIT THEN \ stall on stall or true DROP OVER TRUE \ restart on false ; \ terminal point matching : /0/ ( slider -- slider' ) /halted? IF EXIT THEN DROP 2DUP = ; \ general character matching : /more?/ ( slider -- slider' ) /halted? IF EXIT THEN DROP 2DUP <> ; \ fetch character and advance along string \ if flag=true and there is a character available : /GET ( slider -- slider' c ) /more?/ /halted? IF FALSE EXIT THEN \ dummy if stalled OVER C@ 2>R CHAR+ 2R> \ succeed, char on top ; \ advance to next character, if there is a character available : 'ANY' ( slider -- slider' ) /more?/ /halted? IF EXIT THEN \ stalled, exit >R CHAR+ R> \ advance ; : =/ ( slider c -- slider' ) >R /more?/ /halted? IF RDROP EXIT THEN SWAP COUNT R> = ROT AND ; : ?/ ( slider flag -- slider' ) OVER stall = IF DROP ELSE AND THEN ; \ ( c ) TOKEN token-name : TOKEN ( c -- ) ( execution: ca u flag -- ca' u' flag' ) CREATE C, DOES> C@ =/ ; HEX CHAR + TOKEN '+' CHAR - TOKEN '-' CHAR * TOKEN '*' CHAR / TOKEN '/' CHAR ( TOKEN '(' CHAR ) TOKEN ')' CHAR ^ TOKEN '^' CHAR . TOKEN '.' CHAR , TOKEN ',' CHAR 0 TOKEN '0' CHAR 1 TOKEN '1' CHAR 2 TOKEN '2' CHAR 3 TOKEN '3' CHAR 4 TOKEN '4' CHAR 5 TOKEN '5' CHAR 6 TOKEN '6' CHAR 7 TOKEN '7' CHAR 8 TOKEN '8' CHAR 9 TOKEN '9' \ basic slider compiling words \ ... /: name .... ;/ : /: ( slider -- str0 slider [compiling] ) : POSTPONE /halted? POSTPONE IF POSTPONE EXIT POSTPONE THEN POSTPONE /track/ ; : ;/ ( -- ) ( exec: ca0 ca_i ca_n ca_i' flag -- ca0 ca_n ca_i|i' flag' [exit] ) POSTPONE 0= POSTPONE IF POSTPONE backup/ POSTPONE EXIT POSTPONE THEN POSTPONE update/ POSTPONE ; ; IMMEDIATE \ exit if success so far, retry if failure so far : | ( slider0 slider -- slider0 slider0 | slider [exit] ) POSTPONE 0<> POSTPONE IF POSTPONE update/ POSTPONE EXIT POSTPONE THEN POSTPONE retry/ ; IMMEDIATE : /SELF/ ( sl -- sl' ) POSTPONE /proceed? POSTPONE IF POSTPONE RECURSE POSTPONE THEN ; IMMEDIATE \ extended pattern matching \ NB: Do not use "|" or || within an extended pattern match : /MAYBE ( -- ) ( exec: slider -- ca0 slider' ) POSTPONE /proceed? POSTPONE IF POSTPONE /track/ ; IMMEDIATE : MAYBE/ ( -- ) ( exec: ca0 slider -- slider' ) POSTPONE IF POSTPONE update/ POSTPONE ELSE POSTPONE reset/ POSTPONE THEN POSTPONE THEN ; IMMEDIATE : /ANY ( -- ) ( exec: slider -- ca0 slider' ) POSTPONE /proceed? POSTPONE IF POSTPONE BEGIN POSTPONE /track/ ; IMMEDIATE : ANY/ ( -- ) ( exec: ca0 slider' -- slider'' ) POSTPONE /more?/ POSTPONE WHILE POSTPONE update/ POSTPONE REPEAT POSTPONE reset/ POSTPONE THEN ; IMMEDIATE : /SCAN ( -- ) POSTPONE /proceed? POSTPONE IF POSTPONE BEGIN POSTPONE /track/ POSTPONE /more?/ POSTPONE WHILE POSTPONE TRUE ; IMMEDIATE : HEAD/ ( -- ) POSTPONE /not/ POSTPONE WHILE POSTPONE reset/ POSTPONE 'ANY' POSTPONE REPEAT POSTPONE reset/ POSTPONE ELSE POSTPONE FALSE POSTPONE THEN POSTPONE THEN ; IMMEDIATE : TAIL/ ( -- ) POSTPONE /not/ POSTPONE WHILE POSTPONE reset/ POSTPONE 'ANY' POSTPONE REPEAT POSTPONE update/ POSTPONE ELSE POSTPONE FALSE POSTPONE THEN POSTPONE THEN ; IMMEDIATE /: /DIGIT/ '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;/ /: /DIGITS/ /DIGIT/ /ANY /DIGIT/ ANY/ ;/ /: />>DIGITS/ /SCAN /NUMBER/ TAIL/ ;/ /: />DIGITS/ /SCAN /NUMBER/ HEAD/ ;/ BL TOKEN 'BL' /: /BL/ 'BL' /ANY 'BL' ANY/ ;/ \ Assumes ASCII /: 'WS' /GET BL 1+ < ?/ ;/ /: /WS/ 'WS' /ANY 'WS' ANY/ ;/ 9 TOKEN 'TAB' /: /AWK-WS/ /ANY 'BL' ANY/ 'TAB' | /BL/ ;/ ( ---------- Virtually, Bruce McFarling, Newcastle, ecbm@cc.newcastle.edu.au ) Subject: Re: Match.fs Date: Mon, 21 Aug 2000 00:10:23 GMT From: Neil Bawd Organization: EarthLink Inc. -- http://www.EarthLink.net Newsgroups: comp.lang.forth in 399fe28a.10457408@seagoon.newcastle.edu.au, Bruce McFarling at ecbm@cc.newcastle.edu.au wrote: > \ match.fs MATCH.0.0.0 > [...] 0 [IF] ======================================================= Excellent. The following is an advertisement for "Expand Pattern". It compiles exactly the same as the original matching codes. ------------------------------------------------------- [THEN] ($ CHAR $1 TOKEN '$1' | + - * / ( ) ^ . , 0 1 2 3 4 5 6 7 8 9 $) \ basic slider compiling words \ ... /: name .... ;/ : /: ( slider -- str0 slider [compiling] ) : ($ POSTPONE $1 | /halted? IF EXIT THEN /track/ $) ; : ;/ ( -- ) ( exec: ca0 ca_i ca_n ca_i' flag -- ca0 ca_n ca_i|i' flag' [exit] ) ($ POSTPONE $1 | 0= IF backup/ EXIT THEN update/ ; $) ; \ exit if success so far, retry if failure so far : | ( slider0 slider -- slider0 slider0 | slider [exit] ) ($ POSTPONE $1 | 0<> IF update/ EXIT THEN retry/ $) ; IMMEDIATE : /SELF/ ( sl -- sl' ) ($ POSTPONE $1 | /proceed? IF RECURSE THEN $) ; IMMEDIATE \ extended pattern matching \ NB: Do not use "|" or || within an extended pattern match : /MAYBE ( -- ) ( exec: slider -- ca0 slider' ) ($ POSTPONE $1 | /proceed? IF /track/ $) ; IMMEDIATE : MAYBE/ ( -- ) ( exec: ca0 slider -- slider' ) ($ POSTPONE $1 | IF update/ ELSE reset/ THEN THEN $) ; IMMEDIATE : /ANY ( -- ) ( exec: slider -- ca0 slider' ) ($ POSTPONE $1 | /proceed? IF BEGIN /track/ $) ; IMMEDIATE : ANY/ ( -- ) ( exec: ca0 slider' -- slider'' ) ($ POSTPONE $1 | /more?/ WHILE update/ REPEAT reset/ THEN $) ; IMMEDIATE : /SCAN ( -- ) ($ POSTPONE $1 | /proceed? IF BEGIN /track/ /more?/ WHILE TRUE $) ; IMMEDIATE : HEAD/ ( -- ) ($ POSTPONE $1 | /not/ WHILE reset/ 'ANY' REPEAT reset/ ELSE FALSE THEN THEN $) ; IMMEDIATE : TAIL/ ( -- ) ($ POSTPONE $1 | /not/ WHILE reset/ 'ANY' REPEAT update/ ELSE FALSE THEN THEN $) ; IMMEDIATE ( -- Wil Baden Costa Mesa, California Per neilbawd@earthlink.net ) Subject: Re: Match.fs Date: Tue, 22 Aug 2000 06:11:06 GMT From: ecbm@cc.newcastle.edu.au (Bruce R. McFarling) Organization: The University of Newcastle Newsgroups: comp.lang.forth On Mon, 21 Aug 2000 00:10:23 GMT, Neil Bawd wrote: >The following is an advertisement for "Expand Pattern". Which has been designated a "sick tool" -- and let me assure you that this is a Very Good Thing to be. >It compiles exactly the same as the original matching codes. >------------------------------------------------------- [THEN] Note that: >: /: ( slider -- str0 slider [compiling] ) > : > ($ POSTPONE $1 | > /halted? IF EXIT THEN > /track/ > $) >; Is essentially the " :POSTPONEMENT " which I said a year or two back would be a good thing to have -- except that it provides it in the context of more general functionality, as the " :POSTPONEMENT " would not have replaced CHAR + TOKEN '+' (etc. etc.) with >($ CHAR $1 TOKEN '$1' | > + - * / ( ) ^ . , > 0 1 2 3 4 5 6 7 8 9 >$)