Stylometric Analysis


What is it?


Stylometric analysis involves the use of statistical methods to attempt to ascribe authorship to disputed texts. Lets use a simple example: We have a text A that is indisputably by author A. We have a text B that is indisputably by author B and we have a disputed text C, that we know to be either written by A or written by B.


Then a stylometric analysis test will take the three texts and (ideally) inform us whether text C was written by author A or by author B. How does it work? Well, let’s attempt to reduce the problem down to its simplest form.


Suppose that the test places texts A and B on the ends of a line marked by gradations from 1 to 10.


1        2  3  4  5  6  7  8  9  10


A                                    B


Now we carry out the same test on text C – where will it go? If it also places it at position 1, then we might ascribe it to author A and if it places it at position 10, then we might ascribe it author B. But what if it is placed elsewhere, e.g. is it still by author A if it is placed at position 2? Or is it still by author B if it is placed at position 9? What if it is placed at position 5?


Let’s suppose that we also examined 50 other texts by author A and 50 by author B. All texts by author A occurred at positions 1 or 2 and all by author B occurred at 9 or 10 (ideal). Then we could use this data as a strong indicator of authorship – unless, of course there is the possibility that the text was actually written by someone else completely.


The difficulty arises when either author A and author B are very close together on the scale that we have chosen or quite simply, when introduce more and more authors and they start to populate the linear scale.


In their now famous study of The Federalist Papers, Mosteller and Wallace (1964) complained about the fact that James Madison and Alexander Hamilton who wrote most of The Federalist Papers had average sentence lengths of 34.55 and 34.59 words respectively with standard deviations of 19.2 and 20.3 words – hopeless for discriminating between the two authors.


There are two problems here:


1.      The fact that with a one-dimensional scale, we are forced to place every author onto that scale, especially if we are searching for an unknown author.

2.      We don’t know how many other authors there are out there lurking that have the same or a very similar measure to the one chosen.


Let’s consider these problems in turn:


The one-dimensional scale – This is useful if, the measure discriminates between authors A and B – i.e. it spreads out the works of authors A and B widely and reliably and only if our disputed text C is known to have been written by either author A or author B and no other. Otherwise we are forced to consider measuring and discriminating between our authors on a 2 or more dimensional scale.


How many authors? – How do we know that our measure is accurate and is capable of discriminating between different writers? One way of determining this is to carry out a comprehensive study using hundreds of different writers and if they are all neatly separated out, then you can with some justification call your measure successful.


So what kind of simple, linear measures are there for authorship?


  1. Sentence Length. This one is very simple to describe, it’s the number of words in a sentence averaged out. The measure used is then the mean number of words in a sentence used by the author. As a one-dimensional measure, this has all the pitfalls that have been described above, plus some more. First of all – old texts, e.g. Shakespeare. Shakespearean texts, for example, were edited and very often the punctuation was changed by editors/compositors and hence can be unreliable. Modern works, particularly very recent works may have been influenced by word-processors. Furthermore, sentence length should never, ever be used on transcriptions of spoken text – people don’t speak full-stops.
  2. Word Length. Again a very simple measure that needs little explanation. Count all the words, find out how long they are and produce a distribution. If you’re lucky, this may tell you what century the work was written in. 17th and 18th century texts have a mode of 4 letters, modern texts have a mode of 3 letters – why? Pronouns have got shorter ( e.g. thee and thou are no longer used in standard English). Also the definite article has become much more frequent – compare a concordance listing of Shakespeare and a modern text for example. Why should one individual’s word length profile vary? Inevitably word length profiles will be dominated by a few frequently occurring words.
    1. Word Frequency Graph of a Shakespeare Play
    2. Word frequency Graph of a Contemporary Text
    3. 16th Century and 20th Century Texts Compared


So what can we learn from word length frequency profiles? – Word distributions  follow the Zipf’s Law

which means that a few frequently occurring words will dominate the word frequency profile. Word length frequency is a rather coarse grained way of looking at the frequency of words within texts.

  1. Letter Frequency.  Letter frequency involves counting the frequencies with which letters of the alphabet appear in a text and using this in a similar way to word length frequency as a way of comparing texts. Not surprisingly, in English, we are likely to find highest values at “e” and “t”. Once again it is worth considering why letter frequencies occur the way they do. Once again, they are likely to be dominated by high frequency words and the letters that they contain – Zipf’s Law once again applies. So the same argument could be applied to letter frequencies – why look at measures like this, if they are determined by words themselves?


The next thing to do seems to be to look at words.


Word Based Stylometric Analysis


© Peter W.H. Smith, March 2006