Dotplot Visualization Technique

Six Words of Shakespeare

 
A sequence is tokenized and plotted from left to right and top to bottom, with a dot where the tokens match.

Dots off the main diagonal indicate matches.

To identify matches in millions of tokens, the technique is extended by adding weighting, reconstruction, and approximation methods.


A Million Words of Shakespeare
Weighting prevents matches between frequent tokens from saturating the plot — a typical weighting function uses the inverse of a token's frequency. Reconstruction methods facilitate scaling by accumulating matches from multiple tokens in a single pixel. An approximation that allows plots to be created at nearly interactive rates, is to not plot tokens with small weights. Optional grid lines show the boundaries between input files.

Three years of Canadian
Parliamentary debates in
English and French
37 Million words

Grand Scale
Dotplots provide an overview of millions of data points — more than any other known visualization technique. This combination of squares and diagonals indicates an alignment between a sequence and its translation.


Two DNA Sequences
7000 Nucleotides

Biology
Dotplots were first used in Biology to study homology (self-similarity) in genetic sequences. These diagonals indicate an almost-perfect match between two DNA sequences. Unlike the early dotplots used in Biology, we have generalized the technique to allow arbitrary weighting functions and the plotting of much larger amounts of data.


Texture of Repeated
Data Initializations in
300 Lines of C Code

Similarity Structures

There are algorithmic approaches to identifying string-matches in textual sequences, such as longest-common-substrings, suffix-trees, or the dynamic programming techniques used in the UNIX diff utility.

Dotplots let people use their visual pattern-recognition skills to identify similarities, an approach that is typically less sensitive to noise than traditional algorithmic approaches.

The texture of shrinking diagonals in the plot above is an example that would be difficult to appreciate with a text editor or with any existing algorithmic approach to detecting similarity. The texture is caused by a repeated set of 16 data structure initializations. Each time the initializations are repeated, one of the 16 assigned values is different. Dotplots reveal otherwise hidden structures in our text, code, and data. These similarity structures are used to identify copies, versions, translations, documents about similar subjects, and software modules with similar comments and symbols.

dotplot  · overview · interpretation · application · gallery · documentation

 

return to ImageBeat home web mediasoftware

Copyright © 2000-2004 Jonathan Helfman