History:
ver. 1 Published as "C-Style Arrays in Forth" in Forth Dimensions,
Volume XVIII, Number 4, November-December 1996, p.15-21.
ver. 2.1 Minor improvements. I underline that this paper introduces
based
indexed addressing to Forth rather than an approach to arrays. (19
Dec 1998)
ver. 2.2 Added a reference to discussion (17 May 2000)
To download all files in PKZIP format: CStyleIn.zip
Listing One
Listing Two
Listing Three
Listing Four
Appendix: Exercises for the Novice
Discussion (answers to questions asked via e-mail)
Array elements are usually accessed via 2* + @ (unless programmer prefers to resort to assembler). There is also a traditional array implementation in which the operation of indexing is bound with and hidden in the array name (an array is a function that takes indexes from the stack and leaves the address of element), but it has not become a de-facto standard. The third way, also no de-facto standard, uses words like []CELL and the only difference of the proposed syntax from it is in better naming. This paper shows how one can implement multi-dimensional arrays using this idea.
It may appear that this paper is about arrays in Forth, but it is not exactly so. First of all, this paper introduces based indexed addressing, and this based indexed addressing suggests an approach to arrays.
The proposed syntax for the indexed access operations was inspired by (almost borrowed from) C and Algol-68. One is only sorry that this syntax did not appear 15 years ago.
x is the value stored into the n-th cell of the cell array starting at a-addr. The array cells are numbered starting from zero. Semantically equivalent to SWAP CELLS + @[]! ( x n a-addr -- ) "brackets-store" EXPERIMENTAL
Store the value x into the n-th cell of the cell array starting at a-addr. The array cells are numbered starting from zero. Semantically equivalent to SWAP CELLS + ![]^ ( n a-addr1 -- a-addr2 ) "brackets-pointer" EXPERIMENTAL
Add the size in address units of n cells to a-addr1, giving a-addr2. Semantically equivalent to SWAP CELLS +Note. This is a proposal for the next Forth standard.
CODE [] ( index array -- value )
pop bx
pop di
shl di
push 0 [bx+di]
next c;
CODE []! ( value index array -- )
pop bx
pop di
shl di
pop 0 [bx+di]
next c;
CODE []^ ( index array -- address )
pop bx
pop di
shl di
add bx, di
push bx
next c;
On a 386 Forth that uses in-lining and keeps the data stack top in EBX the code substituted for [] may look like this:
POP EAX
MOV EBX, [EBX] [4*EAX]
which is much better than
XCHG EBX, [ESP]
\ SWAP
SHL EBX, # 2
\ CELLS
POP EAX
\ +
ADD EBX, EAX
\ +, continued
MOV EBX, [EBX]
\ @
that we would have as the result of substituting SWAP CELLS + @
in
line.
A multi-dimensional array is implemented as an array of arrays. Fetching, storing, and pointing to a two-dimensional array element look like this:
j i X [] [] \ X[i][j]
... j i X [] []! \ X[i][j] = ...
j i X [] []^ \ &X[i][j]
Here X is an array containing addresses of arrays of cells. These cell arrays should not necessarily be of the same size. No index range checking is performed, though. The word ARRAY, creates an array of n elements x0 ... x[n-1] and returns its address:
: array, ( x0 x1 ... x[n-1] n -- addr )
( align ) here >r
?dup
if
0 swap 1-
do
i roll ,
-1 +loop
then
r>
;
Although usefulness of this word is restricted by the maximal stack depth, this word enables us to create arrays of (sub)arrays, and these subarrays may have different lengths.
Once the subarrays may have different lengths, we may wish to be able to determine them. Provided that all the subarrays are created by ARRAY, immediately one after another, and that the last top-level array element is followed by a "dummy" pointer, the word []LEN ( i a -- l ) given below returns the length of the i-th subarray of the top-level array a.
code []len ( 1index array -- 2length )
pop bx
pop di
shl di
mov ax, 2 [bx+di]
\ address of the (i+1)-th subarray
sub ax, 0 [bx+di]
\ minus address of the i-th subarray
shr ax
\ gives i-th subarray length in cells
push ax
next c;
The "dummy" pointer is the address of the cell that follows the last
array element. We may consider it as a zero-length subarray. If it is missed,
the word []LEN will not calculate the length of the last subarray.
The array-of-arrays implementation does not necessarily require more memory. For example, if we manipulate matrixes of a special form, say, symmetric, this technique almost halves the amount of memory required.
This implementation also works faster because even on a 486 multiplication
is slower than memory fetch. Probably this consideration is not of too
much importance, though.
: .ARRAY ( array len -- )
." [ "
0 ?DO I OVER [] . LOOP
." ] " DROP
;
: .2ARRAY ( array n_rows -- )
CR ." [ "
0 ?DO I OVER [] I PLUCK []LEN
CR .ARRAY LOOP
." ] " DROP
;
\ The last, delimiter, string of array is needed for []len
\ to calculate the length properly
1 2 3 4 5 5 array,
6 7 8 9 4 array,
10 11 12 3 array,
0 array,
4 array, constant x
0 0 x [] [] .
2 0 x [] [] .
25 1 2 x [] []!
1 2 x [] [] .
1 2 x [] []^ @ .
0 x []len .
2 x []len .
x 3 .2array
BIT[] ( u addr -- b ) "bit-brackets" EXPERIMENTAL
b is the value stored into the n-th bit of the bit array starting at addr. The array bits are numbered starting from zero. The most significant bit has the largest number. The number of bits in an address unit is system dependent.BIT[]! ( x u a-addr -- ) "bit-brackets-store" EXPERIMENTAL
Store the low bit of x into the n-th bit of the bit array starting at addr.BITS ( u1 -- u2 ) "bits" EXPERIMENTAL
u2 is the minimal size in address units of a memory area that contains at least n bits.The specifications for the double-cell indexing words are evident:
D[] ( n a-addr -- d )
D[]! ( d n a-addr -- )
D[]^ ( n a-addr1 -- a-addr2 )
The number n above is the number of the two-cell array element which we want to access.
In systems with 16-bit characters it may be desirable to have character array operations as well. Practice has showed that it is convenient to use the words C[] and C[]! to access TIB (this looks as TIB C[] instead of TIB + C@ or CHARS TIB + C@ ) and to work with character strings. The stack effect specifications for these words are given below:
C[] ( n c-addr -- d )
C[]! ( c n c-addr -- )
C[]^ ( n c-addr1 -- c-addr2 )
Implementation of all these words is a good exercise for a novice, studying
either Forth or Assembler. A good name for a double-cell []LEN analog
is D[]LEN .
We have introduced two new language elements: [] "indexing" and ^ "address". (A sequence (or a set) of characters used in a name and having a meaning for programers we call a language element; for example, the name CHAR+ consists of two language elements: CHAR and + ).
The symbol ^ has not been used in the standard before. There is an old tradition to indicate "address" by ' (tick), but in the modern standard ' (tick) means only "execution token".
The symbol [] has not yes been used, but [ and ] usually denote state-switching or immediacy. This does not lead to naming conflicts because [ and ] have never been used one immediately after another. The only imaginable candidate on the [] name is
: [] ' EXECUTE ; IMMEDIATE
which we can name [EX] or [EXEC] if we will need it.
(To illustrate importance of such analysis we can give the following example: in an ANS Forth system we can define
4 CONSTANT CELL
: CELLS CELL * ;
: #CELLS CELL + 1- CELL / ;
but we cannot define
1 CONSTANT CHAR
: CHARS CHAR * ;
: #CHARS CHAR + 1- CHAR / ;
and get a standard system because according to the standard CHAR means "obtain a character from the input stream" while CHARS means "multiply by the size of a character").
The name of the word [] that may be used to fetch a data address
contains no mention of the data size. This makes the notation more natural:
[] [] or [] D[] looks better than []CELL []CELL or
[]CELL
[]DOUBLE because the size of data is mentioned at most once, and
this is the size of data that we want to access. The "size specifiers"
("D" and "BIT" in D[] and BIT[] ) are
placed before the "operation specifier" [] to keep the same style
as in other such words, e.g. C@ C! CELL+ . Again, the "CELL"
address specifier is omitted because the
size of the stack element is the default for operations that move data
between stack and memory ( @ and ! work with one cell,
but they are not named CELL@ and CELL! ).
The problem with range checking is that one may wish to index cells in a data structure that in principle cannot hold the array length — for example, a string of cells read in from a file or generated by the operating system or some other program.
Range checking is good for arrays but it is extraneous for based indexed
addressing.
CREATE hisarray 100 CELLS ALLOT
50 VALUE mybase
: myarray mybase hisarray []^ ;
\ equivalent to: : myarray hisarray mybase CELLS + ;
Now we can access 52-nd element of hisarray in two ways: as
52 hisarray []or as
2 myarray []Note that for myarray , negative indexes are meaningful. In addition, we can change the number to mybase , thus changing the mapping.
This re-numbering may be done sevaral times, e.g.:
10 VALUE yourbase
: yourarray yourbase myarray []^ ;
This syntax should have appeared 10-15 years ago.