This is G o o g l e's cache of http://home.earthlink.net/~neilbawd/dfc.txt. G o o g l e's cache is the snapshot that we took of the page as we crawled The page may have changed since that time. Click here for the current page To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:5LpLnSieFZIC:home.earthlink.net/~neilb Google is not affiliated with the authors of this page nor responsi \ DFC Differential File Comparison ANEW --DFC-- DECIMAL \ 2002-01-30 Empty file now rejected. \ 2002-01-29 Single line in file has been allowed. \ 2001-10-29 Dependency on ToolBelt has been removed. 0 [IF] ======================================================= <A HREF="dfc.txt"> <SMALL>TEXT</SMALL></A> Wil Baden 2002-01-30 A necessary tool for every programmer is a utility to compare files - particularly source files - to find where and how they are different. A prudent programmer will make a copy of a file before modifying it. In the course of making a series of small modifications to fix a misbehaving application the programmer can easily lose track of just what has been done. Then the file comparison utility can be used to show the changes. This utility can be used to show the differences between released versions as well. To do the job right is not a trivial task. The obvious algorithm will sooner than later fail miserably. The obvious algorithm is to compare lines until a difference is found, then search forward in both files to find where they are the same again. The trick is not to look for differences but to look for the longest common subsequence - the longest set of lines which are the same in both files and in the same order with what's different interspersed. What's left are the differences. How to do this is the subject of Hunt, J. W. and M. D. McIlroy [1976], "An algorithm for differential file comparison," Computing Science Technical Report 41, AT&T Bell Laboratories, Murray Hill, N.J. It is based on Hunt, J.W and T.G. Szymanski, [1977] "A fast algorithm for computing longest common subsequences"," Comm. ACM , vol. 20 no. 5, pp. 350-353. In 1976 I implemented this using my own code in Fortran II for a 8K 16-bit word IBM 1130. It has followed me ever since, becoming re-incarnated on each new platform in whatever the language of the moment was. I even did this in C for Unix because my output format was more useful to me than that of the Unix tool `diff`. Some years ago I did it for MacForth. In the present incarnation it is Standard Forth. Differential file comparison is the foundation of version control systems - SCCS, RCS, SCVS. The algorithm is essentially brute force. Read and save one file, then read records from the other file trying to find with each record a longer common subsequence than you already have. Potentially this could require M * N line comparisons, where M and N are the number of lines in each file. In real life that never happens. The time and memory constraints are still too extravagant. So a really slick trick is used. Instead of comparing whole lines, an integer hash value is computed for each line, and the associated hash values are compared. Making believe that every unique line has a unique hash value, we compute a longest common subsequence. Not until we print do we check whether equal hash values represent identical lines. In 25 years of use this has hardly ever happened. In the very few times it has, the effect has been negligible. (You can tell that it has happened when an insertion appears just before a deletion.) I haven't seen it since 1988. Of course you can force it to happen by using a poor hashing function. However the hashing function doesn't have to be sophisticated. The one used here has worked fine with 32-bit or 16-bit arithmetic. Where I used to work, the Pascal incarnation was used 30 to 200 times a day for ten years, using 16-bit arithmetic. It was used even after the company went to Unix. How to Use old-file-id TO OLDFILE new-file-id TO NEWFILE DFC NEWFILE and OLDFILE may be assigned in either order. You should adapt the file-opening to your environment. Here is an example from John Peters that works on two versions of WinView. S" C:\WIN32FOR\WINVIEW.F" R/O OPEN-FILE DROP TO OLDFILE S" C:\WIN32FORCG\WINVIEW.F" R/O OPEN-FILE DROP TO NEWFILE DFC The following compares an old source for `DFC` with a revision. S" DFC.4TH" R/O OPEN-FILE DROP TO OLDFILE S" DFCNEW.4TH" R/O OPEN-FILE DROP TO NEWFILE DFC The output was: 1 ---- ( DFC - Differential File Comparison. Wil Baden 1976-1996 ) ++++ 1 ( DFC - Differential File Comparison Using HERE Wil Baden ) 2 2 50 37 51 ---- 6000 CONSTANT lcs-space ( The larger the better. ) 52 ---- CREATE LCS lcs-space CELLS ALLOT 53 ---- ++++ 38 0 VALUE lcs-space 0 VALUE LCS 54 39 0 VALUE oldlines 0 VALUE newlines 394 379 ( Differential file comparison. ) ++++ 380 ALIGN HERE TO LCS ++++ 381 UNUSED 1 CELLS - 1+ ALIGNED 1 CELLS / TO LCS-Space 395 382 Read-Newerfile Sort-Hash-Values Mark-Hash-Classes 397 384 Build-Candidate-Table Show-Differences ++++ 385 oldlines newlines - 2 - LCS @ - . ." deletions, " ++++ 386 newlines 1- LCS @ - . ." insertions, " ++++ 387 LCS @ . ." unchanged " CR 398 388 OLDFILE REWIND NEWFILE REWIND This shows that in the old file, DFC.4TH, * Line 1 has been replaced. * Lines 51 through 53 have been replaced by a single line. * A few new lines have been inserted after lines 394 and 397. The numbers in the first column are the line numbers in the first file. The numbers in the second column are the line numbers in the second file. The code has been checked for 16-bit and 32-bit cell size. ------------------------------------------------------- [THEN] 0 [IF] ======================================================= Wil Baden 1976-1996 DFC - Differential File Comparison. Make a line by line comparison of two files, showing where and how they are different. Usage: old-file-id TO OLDFILE new-file-id TO NEWFILE DFC This incarnation dynamically assigns workspace in unused dataspace, giving maximum memory for the data structures. The maximum size of a record, `DFC-MAXLINE`, can be set before running. This lets files with long records be compared. ------------------------------------------------------- [THEN] 0 [IF] ======================================================= Common Functions ------------------------------------------------------- [THEN] \ Comment out definitions that you already have. \ Exchange 0<> with 0= to comment them all out. TRUE 0<> [IF] : BOUNDS ( addr len -- addr+len addr ) over + SWAP ; : IS-WHITE ( char -- flag ) 33 - 0< ; : NOT ( x -- flag ) 0= ; \ : OFF ( addr -- ) FALSE SWAP ! ; \ : ON ( addr -- ) TRUE SWAP ! ; : \\ ( "...<eof>" -- ) BEGIN -1 PARSE 2DROP REFILL 0= UNTIL ; [THEN] 0 [IF] ======================================================= Application Values and Variables OLDFILE ( -- file-id ) Value for file-ID of "old" file. To be set by user. NEWFILE ( -- file-id ) Value for file-ID of "newer" file. To be set by user. DFC-MAXLINE ( -- n ) Value of maximum size of line for DFC comparisons. Initially 255, but may be extended before running `DFC`. DFC-RIGHT-MARGIN ( -- n ) Value of the right-hand margin for automatically wrapping output lines. Set this to a convenient size for you. DFC-COLLATE ( -- addr ) Variable should be set ON to collate instead of showing differences. REWIND ( file-id -- ) Rewind file. ------------------------------------------------------- [THEN] 0 VALUE OLDFILE 0 VALUE NEWFILE 255 VALUE DFC-MAXLINE 72 VALUE DFC-RIGHT-MARGIN VARIABLE DFC-COLLATE DFC-COLLATE OFF : REWIND ( fid -- ) 0 0 ROT REPOSITION-FILE ABORT" Can't REWIND " ; 0 [IF] ======================================================= Implementation ------------------------------------------------------- [THEN] VOCABULARY Differential-File-Compare : INTERFACE ( -- ) GET-ORDER >R over SET-CURRENT R> SET-ORDER ; ALSO Differential-File-Compare DEFINITIONS 0 VALUE DFC-Space \ Will be calculated in unused dataspace. 0 VALUE &OLDTEXT 0 VALUE &NEWTEXT 0 VALUE &MATCHINGTEXT 0 VALUE &Cleaned-Oldtext 0 VALUE &Cleaned-Newtext 0 [IF] ======================================================= PADDING ( -- n ) Environmentally dependent value to find free space in unused data space. 1000 should be more than enough. If `PAD` is defined as usual in data space, `PADDING` can be `PAD HERE -`. [For Power MacForth, 136 is enough.] LCS Cell for each record + 3 * matching-candidates. Thus 6000 cells takes care of files up to at least 1200 lines. Cell-Place ( c_addr len addr -- ) Cell version of `PLACE`. Cell-Count ( addr -- c_addr len ) Cell version of `COUNT`. ------------------------------------------------------- [THEN] \ 1000 CONSTANT PADDING 136 CONSTANT PADDING 0 VALUE LCS : Cell-Place ( c_addr len addr -- ) 2dup 2>R CELL+ SWAP chars MOVE 2R> ! ; : Cell-Count ( addr -- c_addr len ) dup CELL+ SWAP @ -TRAILING ; VARIABLE Skipping : Clean-Line ( c_addr len -- c_addr len' ) ( Remove fairy characters. ) Skipping OFF >R 0 over R> ( c_addr len' c_addr len ) chars BOUNDS ?DO ( c_addr len') I C@ IS-WHITE IF Skipping ON ELSE Skipping @ IF 2dup chars + BL SWAP C! 1+ Skipping OFF THEN 2dup chars + I C@ SWAP C! 1+ THEN LOOP ; 131 CONSTANT HASH-FACTOR : HASH ( c_addr len -- hash-value ) ( Compute hash value for a string. ) Skipping OFF 0 ROT ROT chars BOUNDS ?DO ( hash-value) I C@ IS-WHITE IF Skipping ON ELSE Skipping @ IF HASH-FACTOR * BL + Skipping OFF THEN HASH-FACTOR * I C@ + THEN LOOP ; : Read-Text ( buffer fileid -- flag ) >R dup CELL+ DFC-MAXLINE R> READ-LINE ABORT" Can't read. " SWAP ROT ! ; : Write-Text ( buffer -- ) Cell-Count ( addr len) BEGIN dup DFC-RIGHT-MARGIN > WHILE over DFC-RIGHT-MARGIN ( . . addr2 len2) BEGIN dup WHILE 2dup chars + C@ IS-WHITE NOT WHILE 1- REPEAT ELSE DROP DFC-RIGHT-MARGIN THEN TUCK TYPE ( str len len2) 1+ /STRING ( str len) CR 10 SPACES REPEAT TYPE CR ; 0 [IF] ======================================================= NEWLINES ( -- n ) 1 + lines in newer file. OLDLINES ( -- n ) 1 + lines in old file + 1 + lines in newer file. CAND ( -- addr ) Next candidate. LCS In `Find-Longest-Common-Subsequence`, pointer to candidate. In `Show-Differences`, number of matched lines. X Generally, working variable. In `Show-Differences`, old line number. Y Generally, working variable. In `Show-Differences`, new line number. SLOT ( n -- addr ) Address of _n_'th item in working memory from the bottom. Has the record number of a line. SLOT-H ( n -- addr ) Address of _n_'th item in working memory from the top. Used for the hash value of a line. The memory for this is