|
Dotplot Overview
Overview of the Dotplot visualization technique
Jonathan Helfman, 1994 | |
![]() Dotplot technique tokenizes a sequence and plots it against itself with a dot where the tokens match. ![]() Dotplot of two genetic sequences. ![]() "The Complete Works of Shakespeare" about one million words. ![]() Canadian Parliamentary debates stored in French and English, about 37 million words. ![]() A pattern of shrinking diagonals formed by data-structure initializations. |
A sequence is tokenized and plotted from left to right and top to
bottom with a dot where the tokens match.
Dots off the main diagonal indicate matches.
Any sort of tokenization is possible. For natural language text, tokens are typically delimited by spaces, tabs, and punctuation. For software, lines of code are useful tokens.
Biology
Dotplots were first used in Biology to study homology
(self-similarity) in genetic sequences. The diagonals in the dotplot
of the two genetic sequences on the left indicate an almost-perfect
match. Unlike the early dotplots used in Biology, we have generalized
the technique to allow arbitrary weighting functions and the plotting
of much larger amounts of data.
Grand Scale
Dotplots provide an overview of millions of data points — more than
any other known visualization technique.
To identify matches in millions of tokens, the technique is extended
by adding weighting, reconstruction, and approximation methods.
A typical weighting function uses the inverse of a token's
frequency. Weighting prevents matches between frequent tokens from
saturating the plot. Image reconstruction methods are used to
accumulate matches from multiple tokens into a single pixel. The
approximation of not plotting tokens with weights smaller than a given
threshold allows plots to be created at nearly interactive rates.
Optional grid lines show the boundaries between input files.
Similarity Structures
Dotplots let people use their visual pattern-recognition skills to
identify similarities, an approach that is typically less sensitive to
noise than traditional algorithmic approaches, such as
longest-common-substrings, suffix-trees, or the dynamic programming
techniques used in the UNIX diff utility.
The texture of shrinking diagonals in the plot on the left is an example
that would be difficult to appreciate with a text editor or with any
existing algorithmic approach to detecting similarity. The texture is
caused by a repeated set of 16 data structure initializations. Each
time the initializations are repeated, one of the 16 assigned values
is different.
Dotplots reveal otherwise hidden similarity structures in our text, code, and data. These structures are used to identify copies, versions, translations, documents with similar vocabulary, and software modules with similar comments and symbols. |