|
Dotplot Applications
Jonathan Helfman, 1994 | |
![]() A manual chapter in Dutch, French, German, Italian, Spanish, and Swedish. One million 4-grams. ![]() A multi-language text editor with an interactive alignment plot. ![]() Two versions of dix (8000 lines of C code). ![]() Two versions of xmh (20,000 lines of C code). ![]() Determining file pairs for re-translation. |
Alignment
A combination of squares and diagonals identifies translations. The
dark squares on the main diagonal are formed by tokens matching within
the same language. The diagonal texture is formed by names and numbers
that are the same in each language. The diagonals identify alignments
between translations. An alignment function may be created by fitting
the points along the diagonals. The alignment function matches a
position in one document to the corresponding position in the
translation. Alignments are used to construct multi-lingual
concordances for terminology research.
Alignments are also useful for multi-language text editors. We have extended the Emacs text editor to maintain correspondences and identify anomalies between translations. Scrolling or highlighting in one document will scroll or highlight the corresponding region of the other document. Selecting positions on the residual plot that have a large slope will often identify discrepancies in the translations. Version Identification
Large systems or documents may span several files. Two versions of a
multi-file systems will only appear as diagonals if the files of each
version are in the same relative order. Here the two versions of xmh
are in the same relative order, but the two versions of dix are
not. When reordered diagonals appear in dotplots of software versions,
they usually indicate that file names have changed between versions.
In some cases it is useful to automatically reorder sequences. Re-translation is a service offered by AT&T Business Translation to simplify translating a new version of a previously translated document. Only the differences between the versions need to be translated. The file names and boundaries are typically different in the new version, however, so the file pairs must be determined before the differences can be identified. Diagonals can be reconstructed by an image processing algorithm that identifies the clearest diagonal on a strip of grid boxes. Automatic diagonal reconstruction identifies file pairs in reordered versions, which is a crucial first step in comparing versions of large systems or documents that have diverged over time. |