Similarity Patterns in Language Jonathan Isaac Helfman AT&T Bell Laboratories Murray Hill, NJ 07974-0636 jon@research.att.com ABSTRACT Dotplot is a technique for visualizing patterns of string matches in millions of lines of text and code. Patterns may be explored interactively or detected automatically. Appli- cations include text analysis (author identification, plagiar- ism detection, translation alignment, etc.), software engineering (module and version identification, subroutine categorization, redundant code identification, etc.), and information retrieval (identification of similar records in results of queries). Patterns are interpreted though a visual language. Squares identify unordered matches (documents with lots of matching words or subroutines with lots of matching symbols), while diagonals identify ordered matches (copies, versions, and translations). Patterns of squares and diagonals have more complex interpretations that identify subtler relationships. 1. Introduction to be or not to be • • • • • • • • • • to be or not to be Fig. 1 a) Six words, b) A million words of Shakespeare The dotplot technique is illustrated in Fig. 1a. A sequence is tokenized and plotted from left to right and top to bottom with a dot where the tokens match. Dots off the main diago- nal indicate similarities. While Fig. 1a shows six words of Shakespeare, Fig. 1b shows ‘‘The Complete Works’’ [6] . Grid lines show the boundaries between the concatenated files. Dark areas show a high density of matches. Unlike Fig. 1a, weighting and reconstruction methods are used to display matches from more than one pair of tokens in a sin- gle pixel. Weighting prevents matches between frequent tokens from saturating the plot. Additional details of the dotplot technique and associated browser are described else- where [2]. Small dark squares along the main diagonal in Fig. 1b are caused by names of characters, which generally match within a single work, but not across different works. The exceptions are the European Histories, which share vocabu- lary and form a large dark cluster near the upper left. Squares are also formed by the character sequence of Fig. 3a in which the a’s match each other, but not the b’s, and vice versa. In general, one square indicates a high density of unordered matches, usually due to common vocabulary, while two squares indicate a change in vocabulary. Fig. 2 a) Two versions of xmh (20000 lines of C code) b) Repeated macros (5000 lines of manual pages) Fig. 2a plots two versions of the xmh program. The software examples in this paper use C code from the X11R5 and X11R6 Window System [5]. Software is tokenized into lines so that a dot appears where two entire lines of code match. In Fig. 2a, diagonals are formed in the grid boxes that compare the different versions of each file. Diagonals are modeled by the character sequence of Fig. 3b. In gen- eral, diagonals indicate ordered matches such as copies or versions. a a a a a a a a a b b b b b b b b b • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • a b c d e f g h i a b c d e f g h i • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Fig. 3 a) Squares, b) Diagonals