Dotplot Interpretation
Interpreting Dotplot Patterns
Jonathan Helfman, 1994

Jonathan Helfman


squares formed by a synthetic sequence of a's and b's. Squares indicate a change of vocabulary.


Diagonal features indicate copies, versions, and translations.


A light cross indicates a sub-sequence of unusual vocabulary.


Broken Diagonals indicate exactly where a change has been made to a copy.


A checkerboard formed by reordered squares.


Reordered diagonals.


Density variation.


Shuffled squares and diagonals .

Dotplot Visual Language
Like the vocabulary of a language, the meanings of the basic dotplot features are preserved in many variations and combinations (surprisingly, the meanings of the basic dotplot features are also preserved at different scales). Dotplots can be interpreted by recognizing the basic variations of visual features, understanding their meanings, and interpreting their meanings with respect to your data.

The dotplots on this page are created from synthetic character sequences. Here characters are treated as tokens and a dot is plotted where characters match.

Squares & Diagonals
Squares and diagonals are the basic dotplot features.

Squares are formed by a sorted sequence of a's and b's. The a's match each other, but not the b's, and vice versa. In general, one square indicates a high density of unordered matches, usually due to common vocabulary, while multiple squares indicate a change in vocabulary.

Diagonals are formed by a repeated sequence. In general, diagonals indicate ordered matches such as copies or versions. Diagonals indicate that two sub-sequences have a significant number of words in common, but unlike squares, the common words occur in the same order.

In general squares and diagonals represent unordered and ordered matches, respectively. The entire spectrum of dotplot similarity patterns may be understood in terms of variations and combinations of squares and diagonals.

Insertion
The simplest variation of the basic features involves insertion of non-matching tokens into sequences that would otherwise match. Broken squares indicate an insertion of tokens into a sequence that would otherwise form a square. A broken square may also appear as a light cross. In general, a light cross indicates a unique subsequence inside a sequence with a high density of unordered matches. In a large document collection a light cross might indicate an off-topic document or one in a unique language. In a large software system a light cross might indicate an independent module with unique symbols.

Broken diagonals indicate the insertion of tokens into what would otherwise be perfect copies. The breaks indicate exactly where the changes have been made. When comparing two versions of a document or a source-code file, the breaks in the diagonals indicate exactly where new material has been added.

Reordering
Squares and diagonals may be obscured by reordering. The same sequences that form squares and diagonals, may be reordered to form checkerboards and shattered diagonals.

When reordering patterns appear when looking at multiple versions of documents or code it often indicates that filenames have been changed between versions, changing the relative order in which the files are being compared.

Shuffling
Squares may appear in different densities. When non-repeating tokens are shuffled into the second half of a sequence that creates two squares, the second square looks lighter because fewer tokens are matching.

Any two sequences may be shuffled together. A pattern of shuffled squares and diagonals identifies documents and their translations.

similarity pattern visualization and browser dotplot technique dotplot applications
ImageBeatDotplotDotplot OverviewDotplot Applications
ImageBeatwebmediasoftware

Copyright © Jonathan Helfman, ImageBeat Design, 2000-2004.