Pablo Moscato: From bioinformatics to Shakespeare
Why does an eminent computer scientist team up with a Renaissance literature expert? To design a study using information theory and computer science methods and determine the authorship of plays and poems where that authorship is unknown or disputed, including some of the works of William Shakespeare.
The 2014 study focused on applications in language-based research using an analytical approach based on the idea that authors develop and eventually evolve a highly individual literary style. Analysing variations in the use of words and phrases and their observed frequencies in different plays and poems can therefore provide a strong basis for classifying authorship.
Computer analytics expertise was provided by a team from the University's Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine including the Centre's Director Professor Pablo Moscato, Dr Ahmed Shamsul Arefin, Dr Renato Vimieiro and Dr Carlos Riveros. The team's approach, based on the Jensen-Shannon divergence and a graph partitioning algorithm, could also be applied to analysing large data sets in domain areas other than literary authorship including a biological setting for patient classification using gene co-expressions, a more familiar area of research for the team.
The team conceived, designed and performed the experiments, contributed analysis tools, analysed the data and wrote the research paper in collaboration with Professor Hugh Craig, Director of the University's Centre for Literary and Linguistic Computing. Professor Craig has been an advocate of computer-assisted analysis of language in literature since the controversial field began to emerge in the 1980s. According to Craig, the field is still controversial because people in the literary world don't like or trust numbers and don't understand how and why counting things can work in a literary context. His work, however, has led to several breakthrough findings, particularly in regard to Shakespearean era works.
In this cross-disciplinary study, the team divided 256 plays and poems from 60 authors and a separate category of works of unknown authorship, all from the 16th and 17th centuries, into five clusters, four of which comprised works from a number of authors and works from a single author in the fifth. With the aim of revealing authorial and genre affinities, the Intelligent Archive software tool generated a set of 66,907 unique words from the plays and poems and computed the frequency of each of the words in each work.
Researchers then used an unsupervised graph-based clustering method with the input of a 66,907x256 matrix of observed frequency values. The objective is to cluster the 256 plays and poems. The data set was generated by the Intelligent Archive software, and the CIBM team applied the MST-kNN and + kNN Clique method to generate the clusters, and subsequently analysed the data.
Based on similarity to core subgroups and number of connections to neighbouring authors on the identified clusters, the investigators assigned a plausible authorship for the 17 uncertain works included in the study.
Amongst other conclusions, the study found that it is 'likely' that Shakespeare collaborated with others to pen Edward III, and it is possible that Marlow contributed to Shakespeare's Henry VI Parts 1 and 2, both of which are early plays in closely related genres with a likely overlap in collaborative authorship. The poem 'Funeral Elegy', attributed in the 1990s to Shakespeare, appears together in this study with several other poems by John Ford. The poem 'A Lover's Complaint' links in the clustering to Shakespeare's play 'Cymbeline', supporting the hypothesis that he may have written them at the same time.
Computational analysis can be applied not only to literary works but can be used to identify the idiosyncrasies of any person's language. It enables word patterns to be accurately detected and analyses how language changes with ageing. The techniques can also be applied to investigate how people's language changes with the onset of Alzheimer's Disease which may lead to early detection of the disease, one of Professor Moscato's important, long-term areas of bioinformatics interests.
The bioinformatics team uses the same computational analysis methods to conduct rapid preliminary analysis of large biomarker datasets. The methods have helped identify potentially mislabelled samples, outliers of a major class of disease, and other potentials pitfalls that are identifiable and avoidable during early processing of the data.
One of the promises of this method is its 'scalability'. Moscato's team has already shown that this clustering analysis, powered by supercomputing systems in his laboratory, can handle datasets of online consumer behaviour information with millions of samples. The team is now keen to explore the use of this method in other fields such as the computational and social sciences and applications in engineering and business intelligence.
Arefin, AS, Vimieiro, R, Riveros, C, Craig, H & Moscato, P 2014 'An information theoretic clustering approach for unveiling authorship affinities in Shakespearean era plays and poems', PLoS One, October 27. http://www.plosone.org/article/authors/info%3Adoi%2F10.1371%2Fjournal.pone.0111445