'''How does a diff algorithm work?''' There isn't one true diff algorithm, but several with different characteristics. The basic idea is to find a 'modification script' that will turn Text A into Text B. They use modification operations such as insertion and deletion. Usually, you'd want the minimal number of changes required, but some application also have to weigh operations depending on how much text each one affects, or other factors. With a bit of further modern literature search (such as anyone at the level of a university student and beyond (which of course not everyone is) should be capable), these comprehensive papers are easy to find: * W. Miller & E. W. Myers. A File Comparison Program. _Software -- Practice and Experience_ 15(11), November 1985, pp. 1025-1040. * W. Tichy. The String-to-String Correction Problem with Block Moves. _ACM Transactions on Computer Systems_ 2(4), November 1984, pp. 309-321. * P. Heckel. A technique for Isolating Differences Between Files. _Communications of the ACM_ 21(4), April 1978, pp. 264-268 Some more recent diff-style algorithms: * Xdelta, which beautifully diffs binary files - http://xdelta.org/ * Rsync efficiently diffs binary files over network links - http://samba.org/~tridge/phd_thesis.pdf * bsdiff, which is optimized for compiled executables - http://www.daemonology.net/bsdiff/ The original Unix diff paper was Hunt and Mc''''Ilroy Comp. Sci. Tech. Rep. #41, Bell Telephone Laboratories (1976). There's a copy at http://www.cs.dartmouth.edu/~doug/diff.ps. The missing figures are at http://www.cs.dartmouth.edu/~doug/fig1.jpg and http://www.cs.dartmouth.edu/~doug/fig2.jpg Hunt-Mc''''Ilroy is considerably faster for the line-to-line comparisons than the "folklore" dynamic programming algorithm that has been rediscovered too many times to trace the original discoverer, even though it's often called the WagnerFisherAlgorithm. Further discussion over at http://wiki.tcl.tk/3108. Applications for diff algorithms include comparison of amino acid sequences (and the answer to the question of 'how far apart are they?'), Wikis, storage reduction in a version control system, historical analysis of different copies of old texts (say the Magna Carta), and by the "patch" command to update source files without needing to replace the whole file. From one example of typical diff implementation source used by the cvs program at: http://cvs.sourceforge.net/viewcvs.py/*checkout*/cvsgui/cvsgui/cvs-1.10/diff/analyze.c?&rev=1.1.1.3 The basic algorithm is described in: "An O(ND) Difference Algorithm and its Variations", Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, pp. 251-266; see especially section 4.2, which describes the variation used below. Unless the --minimal option is specified, this code uses the TOO_EXPENSIVE heuristic, by Paul Eggert, to limit the cost to O(N**1.5 log N) at the price of producing suboptimal output for large inputs with many differences. The basic algorithm was independently discovered as described in: "Algorithms for Approximate String Matching", E. Ukkonen, Information and Control Vol. 64, 1985, pp. 100-118. ---- Note: This Wiki just calls the unix diff command, which implements such an algorithm. For a discussion of how this wiki uses one implementation of this algorithm, the unix diff utility, see QuickDiff ---- The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives '''delete-subsequence''', and '''insert-subseqence'''. If a delete and an insert coincide on the same range then it can be labeled as a '''change-subsequence'''. The user doesn't always want to be informed as to which subsequences remain identical, but this information is nonetheless typically computed as a by-product of the basic algorithm. The sophisticated way to find those editing operations is to sort the two sequences in order of the values of the elements of the sequences; for when the basic unit is a text line, usually a hashcode of each line is used to represent the line, and the sequences are sorted by hashcode in order to discover groups of identical elements/lines. Once identical elements are found, they are clumped together into runs which were adjacent in the pre-sorted sequences. One common variant looks for the longest possible such runs, which is an O(N^2) approach, but for N equal to the number of possibilities in each group, '''not''' the number of total elements/lines, so this is usually acceptably fast. At this stage all matching regions between the two sequences/files have been found. Running sequentially through the elements/lines that have not been matched now trivially indicates which regions have been inserted, deleted, or changed. Usually moved regions are indicated as an insert and a delete, but with some extra effort (not typically done by most diff programs) they can be reported as '''moved'''. Actually source code in C can of course be found in e.g. the GNU version of diff or the one from Version 7 Unix, also available online. The smallest versions of this algorithm can be implemented in less space than the above text of this article; it's less complicated than it may sound. -- DougMerritt ---- One aspect of 'diff' is that many times the users of the diff output, is not actually interested in the differences between the two strings/files/text/binaries. What they are more interested in is how similar the two things are. A example of this is attempts to find plagiarism in written text, or code that was re-used in another program. At this time 'diff' is typically used for this, it probably isn't the right type of algorithm to use. This especially becomes apparent when the two strings/files/text/binaries are typically mostly different, but share a small common, or 'nearly' common part or parts. The individual parts are usually in block or groups, with only small segments (such as 'identifiers', or people names) changed. To add to the the larger blocks of similar text, may not be in the same order, but shuffled in sequence! And it is when that happens that the normal 'diff' algorithm breaks down. 'diff' is designed to compare things that are mostly the same, with small amounts of 'modifications'. [i]What can you use to deal with strings that has been 'shuffled', or which are 'mostly different'?[/i] ---- I'd love to see someone implement diff in a modern language like Haskell. http://hackage.haskell.org/package/Diff ''The book that I learned Haskell from uses diff as an example of DynamicProgramming.'' * Which book was that??? Discussing the application of DesignPattern''''''s, AntiPattern''''''s, etc to diff algorithms might be interesting and a valuable resource. -- DougMerritt ---- My favorite mode of the diff program I'm currently uses puts these 2 characters in front of each line in the output: * only in new file -- inserted * <- from old file here -- moved * -> to new file here -- moved (a line starting with "<-" can always be paired with an identical line that starts with "->"). If I don't see any (non-blank) lines that start " 2 (Dots to prevent TabMunging) You may have to adjust it a bit for different dialects, such as adding explicit aliases. It does not give you sequence info and does not distinguish between duplicate lines in the same text. But, it tells you the text of differences. Reader exercise: Use nested queries and SUM(...) to also factor in line quantity occurrences. --top ---- See also: * AlgorithmsRoadMap ---- CategoryAlgorithm