Sketch of Research
Statistical Analysis has long been employed in the study of literary texts and has tended, for obvious reasons to concentrate on the most common or the most uncommon of the linguistic phenomena they display. The discovery, by Burrows, on which much of our work rests is that the frequency-patterns of all the most common words, whatever they may be, are so distinctive, stable, and closely interlocked that they reveal more when they are examined together than when any of them is examined in isolation.
After ten years work, we are able to form useful inferences, at an extremely high level of probability, from almost any comparative study of two or more sets of texts and, upon that basis, to predict how further texts will behave. Such comparisons may reflect change over time or differences of literary form within the writings of a single author; characteristic differences between two or more authors; and larger "class-differences" between different groups of authors. These last include consistent and intelligible differences between eighteenth, nineteenth, and twentieth century authors; between male and female authors; and between Australian and non-Australian authors. In a related line of inquiry, we find that it is possible to detect systematic differences between those texts where one author revises the work of another and those where, starting with a clean sheet, one author imitates the work of another.
Our main and most usual procedure is first to establish frequency tables for the thirty, fifty, or hundred most common words–whatever they may be–of a given set of texts; to extract a correlation matrix; and to subject the matrix to principal components analysis. If there are consistent resemblances and differences between some of the texts and others, the first two or three principal components of the correlation-matrix allow the texts to form into clusters. In a second phase of the operation, the frequency tables are subjected to distribution tests (like Student's t-test and the Mann-Whitney test) to determine just which words play most part in separating the texts in the manner previously observed. The relationship between the clustering of the texts and the patterning of those words that show statistically significant differences between the clusters of texts thus makes it possible to consider what linguistic factors are at work. In a third phase, those words that satisfied the distribution tests are used in a fresh analysis of the original set of texts and of texts not hitherto considered: if the previously formed clusters reflect a true difference between populations, the texts now added should take predictable positions.
Figure 1 shows a good but not exceptional result. In the first of the phases sketched above, forty authorial sets of texts (each set consisting of at least three texts, all of them first-person retrospective narratives of a kind known in the eighteenth century as a "history") were compared. They formed two clusters in which the twenty eighteenth-century writers stood far from the twenty writers of our own century. The hundred most common words were then subjected to the two distribution tests in an attempt to establish which words most influenced this separation. Thirty-eight words satisfied both tests at levels above 0.001, indicating probabilities (for each word) of less than one chance in a thousand that the two clusters did not come from different populations. The analysis was then repeated, with two changes: only the thirty-eight "strongly discriminating" words were used; and twelve fresh authorial sets were added. Six of these are authentic eighteenth and twentieth-century work by Thomas Chatterton, on the one hand, and by David Lodge, John Barth, Angela Carter, J M Coetzee, and Erica Jong on the other. The remaining six sets–by the four novelists last-named and by Peter Ackroyd and Georgette Heyer–all represent texts in which modern writers attempt to imitate the work of their eighteenth or early nineteenth-century forebears. As Figure 1 shows, the two main clusters are quite discrete and the test cases behave according to prediction: the authentic cases lie within or close beside the appropriate clusters; and, though they differ considerably from each other, none of the attempted imitations penetrates the target group. The value of specimens like this, where the truth is already known, is in testing procedures for use in cases where it is not.