A FORMALIZATION OF RETURN ADDRESSES MANIPULATIONS AND CONTROL TRANSFERS M.L.Gassanenko St.Petersburg, Russia. gml@ag.pu.ru Plan: 1. Motivation and Goals 2. Basic Notions: a) TCF - Threaded Code Fragments TCE - Threaded Code Elements b) Equivalence of TCF's c) Position of TCE's 3. Modularity in calculus construction 4. Notation a) Call, Exit, TCF Addresses, TCEs b) The dual notation and dual nature of a call 5. How the calculus works (examples) 6. More definitions and statements 7. Summary of Results ---- 1 ------ MOTIVATION AND GOALS The fundamental difference between data and return addresses manipulations is that return addresses manipulations may result in control transfers. The return addresses manipulations are proven to be a very powerful technique and are an important part of Forth since they enable us to define both classical and unstandard control structures and to implement new methods of program execution, such as backtracking, self-processing data, etc. Unfortunately, return addresses manipulations today are often considered as a bad style, since they are not present in other languages, not formalised and not included into standards. We need a calculus that provided a formal description of existing (and, probably, yet unknown) techniques that use return addresses manipulations to organize code execution. Examples (for a classical system): : BRANCH R> @ >R ; \ ( IP: dest -- ) ( ip:=dest ) : UNNEST RDROP ; \ ( R: a -- ) ( ip:=a ) a.k.a. EXIT \ Backtracking: : ENTER >R ; \ ( addr -- ) call the code at addr : 1-10 ( -> n n ) ( <- n ) \ look over numbers 1..10 1 BEGIN DUP R@ ENTER 1+ DUP 11 = UNTIL DROP RDROP ; : TEST \ print the numbers from 1 to 10 1-10 . ; More problems: threaded code may contain both code and data. Example: \ LIT fetches a value from the threaded code \ and leaves it on the stack : LIT ( -- x ) ( IP: x -- ) R@ @ R> CELL+ >R ; Such word is compiled before any integer that appears in a colon definition. The definition: : X 5 . ; compiles to: code field LIT 5 . EXIT ΘΝΝΝΝΝΝΝΝΝΝΌ ΘΝΝΝΝΝΌ ΘΝΝΝΝΝΌ ΘΝΝΝΝΝΌ ΘΝΝΝΝΝΌ So, we need a criterion to distinguish between code and data. ---- 2 ------ BASIC NOTIONS a) TCE - Threaded Code Element. Either a compiled token of a word or a data element (those typically follow compiled words that process them). TCEs are of known (probably, different) sizes. TCF - Threaded Code Fragment. A sequence of TCEs, typically is processed by the threaded code interpreter. We cannot predict the length of a TCF. b) Equivalence of TCFs. (functional) DUP SWAP DROP == NOOP OVER SWAP == >R DUP R> c) Position. The difference between code and data is the way of processing. Code is processed only by the code interpreter. Data get processed by other functions and not by the code interpreter. A TCE may be in one of the following 4 positions: Active, if it is code: it gets processed only by the code interpreter. Passive, if it is data: it gets processed by other methods. Mixed, or blended, if is both code and data: it gets processed in both ways (it may be self-modifying code or code considered as data). Indifferent, if it is never processed at all (we may call it garbage). ---- 3 ------ 3. Modularity in calculus construction In our calculus we have to use knowledge about data manipulations. All Forth programmers can easily predict the effects of stack operations. There are also formalisms that do the same strictly, but more complicated and more cumbersome. Such knowledge about data manipulations is not really a part of the calculus, the calculus just uses its results. Therefore we decide that formalism and informal knowledge will be two exchangeable modules; our calculus (3rd module) will use any of them, the one which is more appropriate. It is important that both types of knowledge are correct. This enables us to predict stack effects quickly and easily and have a formal basement when we need it. ---- 4 ------ NOTATION a) Call, Exit, TCF Addresses, TCEs x,y - threaded code fragments (TCFs) ; - denotes EXIT '[ x ] - code that places the address of x onto the data stack. 'r[ y ] - code that places the address of y onto the return stack. [r x '] y - call of x: place the address of y onto return stack and transfer control to x. '[ pre[ y ] x ] - address of x, x is preceded by TCE or TCF y (rarely useful unformation). ,f - a single TCE that contains f The numbers denote code fragments that leave them on the stack, e.g.: 5 - leave 5 on the stack ,5 - a single code element that contains 5 b) The dual notation and dual nature of a call There are two notations for a call: [r x '] y - call x 'r[ y ] x - place the address of y onto return stack and execute x Both notations mean the same. This duality reflects the dual nature of a call: on the one hand, call suspends the caller procedure; on the other, it is a control transfer that places an address onto the return stack. The choice that we have allows us to underline which properties we will use. ---- 5 ------ EXAMPLES (HOW THE CALCULUS WORKS) Example 1: A non-local exit. : A 1 RDROP ; : B 2 A 3 ; We will show that B is equivalent to 2 1 : A == [r 1 RDROP ; '] B == [r 2 A 3 ; '] == (we substitute A) == [r 2 [r 1 RDROP ; '] 3 ; '] == (we change the call notation) == [r 2 'r[ 3 ; ] 1 RDROP ; '] == ( 1 does not affect the return stack) == [r 2 1 'r[ 3 ; ] RDROP ; '] == (an evident cancellation) == [r 2 1 ; '] == ( 2 1 do not affect the return stack) == 2 1 Example 2: Accessing data stored in threaded code. : LIT R@ @ R> CELL+ >R ; We will show that in case of non-modifying code LIT ,5 leaves 5 on the stack: (the symbol >> denotes the rest of code) LIT ,5 >> == [r R@ @ R> CELL+ >R ; '] ,5 >> == == 'r[ ,5 >> ] R@ @ R> CELL+ >R ; == (we "predict the effect" of R@ and R> ) == 'r[ ,5 >> ] @ 'r[ ,5 >> ] CELL+ >R ; == (code is R/O; evident calculations) == 5 '[ pre[ ,5 ] >> ] >R ; == ( ,5 plays no role here) == 5 '[ >> ] >R ; == (it is a control transfer) == 5 >> ---- 6 ------ 6. More definitions and statements Some code fragments ignore the top stack item, i.e. neither affect nor depend on it. It turned out that this property is of fundamental importance. t:S1 -- t ignores the top stack item S1( t ) -- a code fragment that "acts like t but ignores the top stack item". Example: S1( DROP ) == NIP The definition: ForAll N N S1( t ) == t N Theorem of comparison by S1: S1( x ) == S1( y ) <=> x == y These notions have return stack analogs: R1( t ) -- a code fragment that "acts like t but ignores the top RETURN stack item". We write that t:R if t ignores the whole return stack. Its definition: t:R iff t == R1( t ) Theorem of call equivalence: t == [r R1( t ) ; '] (a call of R1( t ) is equivalent to t ) Consequence: t:R => t == [r t ; '] (no problems with inlining t if t ignores the return stack) This theorem shows the problems that arise when we use inline substitution instead of calls, and the ways of their solution (not to inline, inline along with the nesting code, trasform and inline, transform all, etc.) ---- 7 ------ 7. Summary of Results A formalism that allows to analyse control transfers due to code addresses manipulations has been developed. The formalism can distinguish between code and data elements in threaded code, and may be based on either formal or intuitive knowledge about stack effects. The calculus provides a formal basement for an important class of techniques -- the techniques that use return addresses manipulations to organize code execution. The theorem of call equivalence explains when and how a call may be replaced by an inline substitution, which is important for optimising compilers. Since return addresses manipulations are very unusual for non-Forth programmers, the calculus may be useful as a means of: analysis, proof of correctness, equivalent transformations, explanation of techniques based on return addresses manipulations. 8. References [BaN88R] Baranoff S.N., Nozdrunoff N.R. Jazyk Fort i ego realizatsii. - L:Mashinostroenie, 1988. (The Forth Language and Its Implementations, in Russian.) [Cha91] Charlton, Gordon. "FOSM, A FOrth String Matcher". Proc. of the EuroFORML'91 Conf., 1991. [Gas93] Gassanenko M.L. Multi-FCA DOES>: Implementation via Self-Modifying Code. Proc. of the euroFORTH'93 conference, held from October 15th-18th 1993 at the museum of Marianske Lazne (Marienbad), Czech Republic, 8 p. [Gas93R] Gassanenko M.L. Novye sintaksicheskie konstruktsii i be`ktreking dlja jazyka Fort.//Problemy tehnologii programmirovanija - SPb: SPIIRAN, 1991-93, p.148-162. (New Control Structures and Backtracking for the Forth Language, in Russian.) [Gas94] Gassanenko, M.L. BacFORTH: An Approach to New Control Structures. Proc. of the EuroForth'94 conference, 4-6 November 1994, Royal Hotel, Winchester, UK, p.39-41. [Poi91] Poial J. Multiple Stack-effects of Forth Programs. 1991 FORML Conference Proceedings, euroFORML'91 Conference, Oct 11-13, 1991, Marianske Lazne, Czechoslovakia, Forth Interest Group, Inc., Oakland, USA, 1992, 400-406. [Poi93] Poial J. Some Ideas on Formal Specification of Forth Programs. 9th euroFORTH conference on the FORTH programming language and FORTH processors, Oct 15-18, 1993, Marianske Lazne, Czech Republic, 1993, 4 pp. [Poi94] Poial, Jaanus. Forth and Formal Language Theory. Proc. of the EuroForth'94 conference, 4-6 November 1994, Royal Hotel, Winchester, UK, 6 pp. [Rod90a] Rodriguez, B. A BNF Parser in Forth. ACM SIGForth Newsletter, vol.2, no.2, December 1990, p.13-18. [Rod90b] Rodriguez, Bradford J. Rules Evaluation Through Program Execution. Proc. of the 1990 Rochester Forth Conf., p.123-125. [StK92] Stoddart B., Knaggs P. The (Almost) Complete Theory of Forth Type Inference. Proc. EuroForth Conference 1992. [StK93] Stoddart B., Knaggs P. Type Inference in Stack Based Languages. Formal Aspects of Computing, BCS, 1993, 5, 289-298. [Sto93] Stoddart B. Self Reference and Type Inference in Forth. 9th euroFORTH conference on the FORTH programming language and FORTH processors, Oct 15-18, 1993, Marianske Lazne, Czech Republic, 1993, 9 pp. [Tuz88R] Tuzov V.A. Funktsional'nye metody programmirovanija./ Instrumentalnye sredstva programmirovanija - L:LIIAN, 1988. - p.129-143. (The Functional Methods of Programmming, in Russian) [Tuz90R] Tuzov V.A. Jazyki predstavlenija znanij. - L:LGU, 1990. (The Languages of Knowledge Representation [what an ideal language of knowledge representation should be], in Russian.)