Author: M.L.Gassanenko
(c) M.L.Gassanenko, 1996-98
 

C-Style Indexing in ANS Forth

M.L.Gassanenko
mlg@forth.org

History:
ver. 1 Published as "C-Style Arrays in Forth" in Forth Dimensions, Volume XVIII, Number 4, November-December 1996, p.15-21.
ver. 2.1 Minor improvements. I underline that this paper introduces based indexed addressing to Forth rather than an approach to arrays. (19 Dec 1998)
ver. 2.2 Added a reference to discussion (17 May 2000)

To download all files in PKZIP format: CStyleIn.zip
Listing One
Listing Two
Listing Three
Listing Four
Appendix: Exercises for the Novice
Discussion (answers to questions asked via e-mail)

Abstract

This paper proposes a C-like indexing for Forth. The notation may be used for indexing cell arrays (C-style zero-based); it also can support multi-dimensional arrays. A similar syntax may be used for bit or double-cell arrays, or arrays of structures. Although the idea is not new, the notation seems to be felicitous and might be included into the next standard. The paper also shows how analysis of possible name conflicts should be performed.

Introduction

One of the (not too) modern processors features that is rarely utilized by (even modern) Forth is based indexed addressing. There have been several approaches to array accessing but so far none of them has been considered enough felicitous to be included into the standard.

Array elements are usually accessed via 2* + @ (unless programmer prefers to resort to assembler). There is also a traditional array implementation in which the operation of indexing is bound with and hidden in the array name (an array is a function that takes indexes from the stack and leaves the address of element), but it has not become a de-facto standard. The third way, also no de-facto standard, uses words like []CELL and the only difference of the proposed syntax from it is in better naming. This paper shows how one can implement multi-dimensional arrays using this idea.

It may appear that this paper is about arrays in Forth, but it is not exactly so. First of all, this paper introduces based indexed addressing, and this based indexed addressing suggests an approach to arrays.

The proposed syntax for the indexed access operations was inspired by (almost borrowed from) C and Algol-68. One is only sorry that this syntax did not appear 15 years ago.

Specifications

[]      ( n a-addr -- x )                "brackets"                EXPERIMENTAL
x is the value stored into the n-th cell of the cell array starting at a-addr. The array cells are numbered starting from zero. Semantically equivalent to SWAP CELLS + @
[]!     ( x n a-addr -- )                "brackets-store"      EXPERIMENTAL
Store the value x into the n-th cell of the cell array starting at a-addr. The array cells are numbered starting from zero. Semantically equivalent to SWAP CELLS + !
[]^     ( n a-addr1 -- a-addr2 ) "brackets-pointer"      EXPERIMENTAL
Add the size in address units of n cells to a-addr1, giving a-addr2. Semantically equivalent to SWAP CELLS +
Note. This is a proposal for the next Forth standard.
 

Implementation

What follows is an F-PC implementation of these words:

CODE [] ( index array -- value )
     pop bx
     pop di
     shl di
     push 0 [bx+di]
     next c;

CODE []! ( value index array -- )
     pop bx
     pop di
     shl di
     pop 0 [bx+di]
     next c;

CODE []^ ( index array -- address )
     pop bx
     pop di
     shl di
     add bx, di
     push bx
     next c;

On a 386 Forth that uses in-lining and keeps the data stack top in EBX the code substituted for [] may look like this:

        POP EAX
        MOV EBX, [EBX] [4*EAX]

which is much better than

        XCHG EBX, [ESP]         \ SWAP
        SHL  EBX, # 2           \ CELLS
        POP  EAX                \ +
        ADD  EBX, EAX           \ +, continued
        MOV  EBX, [EBX]         \ @

that we would have as the result of substituting SWAP CELLS + @ in line.
 

Multi-Dimensional Arrays

This section shows that multi-dimensional arrays can equally be implemented this way.

A multi-dimensional array is implemented as an array of arrays. Fetching, storing, and pointing to a two-dimensional array element look like this:

    j i X [] []     \ X[i][j]
... j i X [] []!    \ X[i][j] = ...
    j i X [] []^    \ &X[i][j]

Here X is an array containing addresses of arrays of cells. These cell arrays should not necessarily be of the same size. No index range checking is performed, though. The word ARRAY, creates an array of n elements x0 ... x[n-1] and returns its address:

: array, ( x0 x1 ... x[n-1] n -- addr )
   ( align )  here >r
   ?dup
   if
         0 swap 1-
         do
                  i roll ,
        -1 +loop
   then
   r>
;

Although usefulness of this word is restricted by the maximal stack depth, this word enables us to create arrays of (sub)arrays, and these subarrays may have different lengths.

Once the subarrays may have different lengths, we may wish to be able to determine them. Provided that all the subarrays are created by ARRAY, immediately one after another, and that the last top-level array element is followed by a "dummy" pointer, the word []LEN ( i a -- l ) given below returns the length of the i-th subarray of the top-level array a.

code []len ( 1index array -- 2length )
     pop bx
     pop di
     shl di
     mov ax, 2 [bx+di]       \ address of the (i+1)-th subarray
     sub ax, 0 [bx+di]       \ minus address of the i-th subarray
     shr ax                  \ gives i-th subarray length in cells
     push ax
     next c;

The "dummy" pointer is the address of the cell that follows the last array element. We may consider it as a zero-length subarray. If it is missed, the word []LEN will not calculate the length of the last subarray.
 

Assessing the Multi-Dimensional Array Implementation

Here we compare the array-of-arrays implementation with the more traditional one where index calculation involves multiplication by the number of columns.

The array-of-arrays implementation does not necessarily require more memory. For example, if we manipulate matrixes of a special form, say, symmetric, this technique almost halves the amount of memory required.

This implementation also works faster because even on a 486 multiplication is slower than memory fetch. Probably this consideration is not of too much importance, though.
 

Some Examples

These examples scarcely need comments. The screen output is shown at the Listing Two.

: .ARRAY ( array len -- )
   ." [ "
   0 ?DO   I OVER [] .   LOOP
   ." ] " DROP
;
: .2ARRAY ( array n_rows -- )
   CR ." [ "
   0 ?DO   I OVER []  I PLUCK []LEN  CR .ARRAY   LOOP
   ." ] " DROP
;

\ The last, delimiter, string of array is needed for []len
\ to calculate the length properly
1  2  3  4  5   5 array,
6  7  8  9      4 array,
10 11 12        3 array,
                0 array,
                          4 array, constant x
0 0 x [] [] .
2 0 x [] [] .
25 1 2 x [] []!
1 2 x [] [] .
1 2 x [] []^ @ .
0 x []len .
2 x []len .
x 3 .2array
 
 

Bit, Double-Cell and Other Arrays

A similar syntax may be used to handle bit arrays.

BIT[]   ( u addr -- b )          "bit-brackets"          EXPERIMENTAL

b is the value stored into the n-th bit of the bit array starting at addr. The array bits are numbered starting from zero. The most significant bit has the largest number. The number of bits in an address unit is system dependent.
BIT[]!  ( x u a-addr -- )        "bit-brackets-store"    EXPERIMENTAL
Store the low bit of x into the n-th bit of the bit array starting at addr.
BITS    ( u1 -- u2 )             "bits"                  EXPERIMENTAL
u2 is the minimal size in address units of a memory area that contains at least n bits.
The specifications for the double-cell indexing words are evident:

D[]      ( n a-addr -- d )
D[]!     ( d n a-addr -- )
D[]^     ( n a-addr1 -- a-addr2 )

The number n above is the number of the two-cell array element which we want to access.

In systems with 16-bit characters it may be desirable to have character array operations as well. Practice has showed that it is convenient to use the words C[] and C[]! to access TIB (this looks as TIB C[] instead of TIB + C@ or CHARS TIB + C@ ) and to work with character strings. The stack effect specifications for these words are given below:

C[]      ( n c-addr -- d )
C[]!     ( c n c-addr -- )
C[]^     ( n c-addr1 -- c-addr2 )

Implementation of all these words is a good exercise for a novice, studying either Forth or Assembler. A good name for a double-cell []LEN analog is D[]LEN .
 

Consistency

In this section we ascertain that introduction of the proposed new names into the standard does not lead to naming problems.

We have introduced two new language elements: [] "indexing" and ^ "address". (A sequence (or a set) of characters used in a name and having a meaning for programers we call a language element; for example, the name CHAR+ consists of two language elements: CHAR and + ).

The symbol ^ has not been used in the standard before. There is an old tradition to indicate "address" by ' (tick), but in the modern standard ' (tick) means only "execution token".

The symbol [] has not yes been used, but [ and ] usually denote state-switching or immediacy. This does not lead to naming conflicts because [ and ] have never been used one immediately after another. The only imaginable candidate on the [] name is

: [] ' EXECUTE ; IMMEDIATE

which we can name [EX] or [EXEC] if we will need it.

(To illustrate importance of such analysis we can give the following example: in an ANS Forth system we can define

4 CONSTANT CELL
: CELLS CELL * ;
: #CELLS CELL + 1- CELL / ;

but we cannot define

1 CONSTANT CHAR
: CHARS CHAR * ;
: #CHARS CHAR + 1- CHAR / ;

and get a standard system because according to the standard CHAR means "obtain a character from the input stream" while CHARS means "multiply by the size of a character").

The name of the word [] that may be used to fetch a data address contains no mention of the data size. This makes the notation more natural: [] [] or [] D[] looks better than []CELL []CELL or []CELL []DOUBLE because the size of data is mentioned at most once, and
this is the size of data that we want to access. The "size specifiers" ("D" and "BIT" in D[] and BIT[] ) are placed before the "operation specifier" [] to keep the same style as in other such words, e.g. C@ C! CELL+ . Again, the "CELL" address specifier is omitted because the
size of the stack element is the default for operations that move data between stack and memory ( @ and ! work with one cell, but they are not named CELL@ and CELL! ).
 
 

Optional Range Checking

The most evident approach to this is to store the array length into the -1st cell and to redefine the array words to use it to perform range checking. This is shown in Listings Three and Four. Remember that such redefinition enables range checking only for words that get compiled after it, and already compiled words will still use the "unprotected" version. Although the naming issue is
always up to the programmer's taste, it may be recommended to redefine the array words locally (within a module vocabulary), and to be able to use their original, "unprotected" versions
under some other names.

The problem with range checking is that one may wish to index cells in a data structure that in principle cannot hold the array length — for example, a string of cells read in from a file or generated by the operating system or some other program.

Range checking is good for arrays but it is extraneous for based indexed addressing.
 

More features

The word []^ may be used to change the numbers of array elements. For example,

CREATE hisarray 100 CELLS ALLOT
50 VALUE mybase
: myarray mybase hisarray []^ ;
\ equivalent to: : myarray hisarray mybase CELLS + ;

Now we can access 52-nd element of hisarray in two ways: as

52 hisarray []
or as
2 myarray []
Note that for myarray , negative indexes are meaningful. In addition, we can change the number to mybase , thus changing the mapping.

This re-numbering may be done sevaral times, e.g.:

10 VALUE yourbase
: yourarray yourbase myarray []^ ;

Conclusion

The array indexing words presented here enable Forth to utilize the based indexed addressing present on most processors. The words [] []! []^ might be included into the next standard.

This syntax should have appeared 10-15 years ago.
 
 
 



Go to: discussion (answers to questions asked via e-mail). Why [] and not []@, the syntax for accessing character arrays is C[], etc.