Centre for Literary and Linguistic and Computing, Research

Centre for
LITERARY AND LINGUISTIC COMPUTING

Research

PCA Online

The Shakespeare Computational Stylistics Facility

The CSF is an initiative of the Australian e-Humanities Network. It was created by the Centre for Literary and Linguistic Computing at the University of Newcastle, NSW, Australia. The project was funded by the Australian Research Council through a Learned Academies Special Projects grant. Design and programming is by Russell Whipp.

Introduction

The Computational Stylistics Facility presents a set of Shakespeare play texts with a ready-made apparatus for computational-stylistics exploration. Within its parameters, users can define any number of variations on what is analysed and how. The system has been designed for use by those with no experience in computational stylistics, and is set out so as to work intuitively as far as possible.

The texts can be analysed as whole plays, blocks, i.e. sequential segments of plays, or as character parts. Word-variables for the analyses can be typed or pasted in, or the system can calculate for you the 20, 50, or 100 most common words of the whole set or the sub-set of texts you are using. The best way to start is with a simple walk-through. Four are offered below.

Fuller instructions are in the CSF manual which you can download here. You may also like to consult a recent poster presentation on the CSF.

CSF Walk-throughs

(Note that when navigating back to re-run any analysis, you need to "Remove all" of your play or character selections, and "Clear" your word selections in order to start afresh.)

  1. Exploring broad differences between the three genres
  2. Exploring differences between the three Falstaffs of Henry IV Part 1, Henry IV Part 2 and The Merry Wives of Windsor
  3. Exploring larger characters' use of the various forms of the second person pronoun
  4. Exploring the consistency in the contrast between some representative tragedies and comedies of Shakepeare's middle period

Start the CSF applet

Developed at the Centre for Literary and Linguistic Computing, University of Newcastle, Australia
Director: Hugh Craig
Software developers: R Whipp, Michael Ralston

Introduction

What is the Intelligent Archive?

The Intelligent Archive program is a Java based piece of software used for text analysis within the University of Newcastle's Centre for Literary and Linguistic Computing (CLLC). The software is used in various different ways by the Centre researchers who are focusing on different aspects of text analysis. Professor Hugh Craig is the chief architect of the IA's development; much of the core functionality that is required for Prof. Craig's work also applies to others working within the CLLC.

The typical CLLC project involves preparing a set of texts for computational stylistics operations, with the ultimate purpose of determining authorship of a disputed literary work, or analysing the style of a work or group of works. The IA serves these projects by organising sets of texts and making word counts which can be exported for analysis in an external spreadsheet or statistics program. It is an interface to an archive of texts,
and incorporates a range of counting functionalities which can be determined by the user, hence is an 'intelligent archive'. While most text-processing programs focus on more linguistic outputs, such as concordances, or lists of the commonest collocates of a given word, the IA's primary function is more statistical, centred on producing frequency counts of words.

Versions

The software is available in two versions, Budgerigar and Galah. Both provide the following core facilities:

  • Management of individual texts of different formats within a virtual library or repository
  • Management of text sets, which are user-created groups of these texts
  • Word frequency analysis on individual texts, tagged sections within texts, text
    sets, contiguous block segments of a specified size within texts, etc.

Intelligent Archive Galah also includes functionality for 'experiments', namely Jensen-Shannon Divergence and Burrows' Zeta (incorporating Burrows' Iota) for both single words and word pairs.  Documentation is available for IA Budgerigar but not yet for IA Galah.

System Requirements

The Intelligent Archive is written using the Java platform. As such it is able to run on any operating system which supports Java and therefore will require the installation of a Java Runtime Environment. This can be downloaded here free of charge.

The specifications of the computer used will vary according to which features of the software you wish to use. The core functionality only requires a very basic system with at least 512MB of memory. The software does not require a fast CPU, however, it will be able to provide its results quicker if equipped with a quicker CPU. The software does not currently benefit from being used on a system with multiple CPUs or CPU cores.

The software itself uses less than 1MB of disk space. You will also require enough disk space to store all texts added to the text repository.

Sketch of Research

Statistical Analysis has long been employed in the study of literary texts and has tended, for obvious reasons to concentrate on the most common or the most uncommon of the linguistic phenomena they display. The discovery, by Burrows, on which much of our work rests is that the frequency-patterns of all the most common words, whatever they may be, are so distinctive, stable, and closely interlocked that they reveal more when they are examined together than when any of them is examined in isolation.

After ten years work, we are able to form useful inferences, at an extremely high level of probability, from almost any comparative study of two or more sets of texts and, upon that basis, to predict how further texts will behave. Such comparisons may reflect change over time or differences of literary form within the writings of a single author; characteristic differences between two or more authors; and larger "class-differences" between different groups of authors. These last include consistent and intelligible differences between eighteenth, nineteenth, and twentieth century authors; between male and female authors; and between Australian and non-Australian authors. In a related line of inquiry, we find that it is possible to detect systematic differences between those texts where one author revises the work of another and those where, starting with a clean sheet, one author imitates the work of another.

Our main and most usual procedure is first to establish frequency tables for the thirty, fifty, or hundred most common words–whatever they may be–of a given set of texts; to extract a correlation matrix; and to subject the matrix to principal components analysis. If there are consistent resemblances and differences between some of the texts and others, the first two or three principal components of the correlation-matrix allow the texts to form into clusters. In a second phase of the operation, the frequency tables are subjected to distribution tests (like Student's t-test and the Mann-Whitney test) to determine just which words play most part in separating the texts in the manner previously observed. The relationship between the clustering of the texts and the patterning of those words that show statistically significant differences between the clusters of texts thus makes it possible to consider what linguistic factors are at work. In a third phase, those words that satisfied the distribution tests are used in a fresh analysis of the original set of texts and of texts not hitherto considered: if the previously formed clusters reflect a true difference between populations, the texts now added should take predictable positions.

Figure 1 shows a good but not exceptional result. In the first of the phases sketched above, forty authorial sets of texts (each set consisting of at least three texts, all of them first-person retrospective narratives of a kind known in the eighteenth century as a "history") were compared. They formed two clusters in which the twenty eighteenth-century writers stood far from the twenty writers of our own century. The hundred most common words were then subjected to the two distribution tests in an attempt to establish which words most influenced this separation. Thirty-eight words satisfied both tests at levels above 0.001, indicating probabilities (for each word) of less than one chance in a thousand that the two clusters did not come from different populations. The analysis was then repeated, with two changes: only the thirty-eight "strongly discriminating" words were used; and twelve fresh authorial sets were added. Six of these are authentic eighteenth and twentieth-century work by Thomas Chatterton, on the one hand, and by David Lodge, John Barth, Angela Carter, J M Coetzee, and Erica Jong on the other. The remaining six sets–by the four novelists last-named and by Peter Ackroyd and Georgette Heyer–all represent texts in which modern writers attempt to imitate the work of their eighteenth or early nineteenth-century forebears. As Figure 1 shows, the two main clusters are quite discrete and the test cases behave according to prediction: the authentic cases lie within or close beside the appropriate clusters; and, though they differ considerably from each other, none of the attempted imitations penetrates the target group. The value of specimens like this, where the truth is already known, is in testing procedures for use in cases where it is not.

Cambridge Platonists

Fiction from 1700

John Curtin and the Westralian
John Locke and Thomas Sydenham
Letters 1570-1680
Mary Fortune and James Skipp Borlase
Renaissance Plays and Poems
Restoration Verse
Samuel Beckett and James Joyce


Victorian PeriodicalsVirginia Woolf Letters