Dotplot Overview
Overview of the Dotplot visualization technique
Jonathan Helfman, 1994

Jonathan Helfman


Dotplot technique tokenizes a sequence and plots it against itself with a dot where the tokens match.


Dotplot of two genetic sequences.


"The Complete Works of Shakespeare" about one million words.


Canadian Parliamentary debates stored in French and English, about 37 million words.


A pattern of shrinking diagonals formed by data-structure initializations.

A sequence is tokenized and plotted from left to right and top to bottom with a dot where the tokens match. Dots off the main diagonal indicate matches.

Any sort of tokenization is possible. For natural language text, tokens are typically delimited by spaces, tabs, and punctuation. For software, lines of code are useful tokens.

Biology
Dotplots were first used in Biology to study homology (self-similarity) in genetic sequences. The diagonals in the dotplot of the two genetic sequences on the left indicate an almost-perfect match. Unlike the early dotplots used in Biology, we have generalized the technique to allow arbitrary weighting functions and the plotting of much larger amounts of data.

Grand Scale
Dotplots provide an overview of millions of data points — more than any other known visualization technique. To identify matches in millions of tokens, the technique is extended by adding weighting, reconstruction, and approximation methods. A typical weighting function uses the inverse of a token's frequency. Weighting prevents matches between frequent tokens from saturating the plot. Image reconstruction methods are used to accumulate matches from multiple tokens into a single pixel. The approximation of not plotting tokens with weights smaller than a given threshold allows plots to be created at nearly interactive rates. Optional grid lines show the boundaries between input files.

Similarity Structures
Dotplots let people use their visual pattern-recognition skills to identify similarities, an approach that is typically less sensitive to noise than traditional algorithmic approaches, such as longest-common-substrings, suffix-trees, or the dynamic programming techniques used in the UNIX diff utility. The texture of shrinking diagonals in the plot on the left is an example that would be difficult to appreciate with a text editor or with any existing algorithmic approach to detecting similarity. The texture is caused by a repeated set of 16 data structure initializations. Each time the initializations are repeated, one of the 16 assigned values is different.

Dotplots reveal otherwise hidden similarity structures in our text, code, and data. These structures are used to identify copies, versions, translations, documents with similar vocabulary, and software modules with similar comments and symbols.

similarity pattern visualization and browser dotplot interpretation dotplot applications
ImageBeatDotplotDotplot InterpretationDotplot Applications
ImageBeatwebmediasoftware

Copyright © Jonathan Helfman, ImageBeat Design, 2000-2004.