Stylometric analysis involves the use of statistical methods to attempt to ascribe authorship to disputed texts. Lets use a simple example: We have a text A that is indisputably by author A. We have a text B that is indisputably by author B and we have a disputed text C, that we know to be either written by A or written by B.
Then a stylometric analysis test will take the three texts and (ideally) inform us whether text C was written by author A or by author B. How does it work? Well, let’s attempt to reduce the problem down to its simplest form.
Suppose that the test places texts A and B on the ends of a line marked by gradations from 1 to 10.
1 2 3 4 5 6 7 8 9 10
Now we carry out the same test on text C – where will it go? If it also places it at position 1, then we might ascribe it to author A and if it places it at position 10, then we might ascribe it author B. But what if it is placed elsewhere, e.g. is it still by author A if it is placed at position 2? Or is it still by author B if it is placed at position 9? What if it is placed at position 5?
Let’s suppose that we also examined 50 other texts by author A and 50 by author B. All texts by author A occurred at positions 1 or 2 and all by author B occurred at 9 or 10 (ideal). Then we could use this data as a strong indicator of authorship – unless, of course there is the possibility that the text was actually written by someone else completely.
The difficulty arises when either author A and author B are very close together on the scale that we have chosen or quite simply, when introduce more and more authors and they start to populate the linear scale.
In their now famous study of The Federalist Papers, Mosteller and Wallace (1964) complained about the fact that James Madison and Alexander Hamilton who wrote most of The Federalist Papers had average sentence lengths of 34.55 and 34.59 words respectively with standard deviations of 19.2 and 20.3 words – hopeless for discriminating between the two authors.
There are two problems here:
1. The fact that with a one-dimensional scale, we are forced to place every author onto that scale, especially if we are searching for an unknown author.
2. We don’t know how many other authors there are out there lurking that have the same or a very similar measure to the one chosen.
Let’s consider these problems in turn:
The one-dimensional scale – This is useful if, the measure discriminates between authors A and B – i.e. it spreads out the works of authors A and B widely and reliably and only if our disputed text C is known to have been written by either author A or author B and no other. Otherwise we are forced to consider measuring and discriminating between our authors on a 2 or more dimensional scale.
How many authors? – How do we know that our measure is accurate and is capable of discriminating between different writers? One way of determining this is to carry out a comprehensive study using hundreds of different writers and if they are all neatly separated out, then you can with some justification call your measure successful.
So what kind of simple, linear measures are there for authorship?
So what can we learn from word length frequency profiles? – Word distributions follow the Zipf’s Law
which means that a few frequently occurring words will dominate the word frequency profile. Word length frequency is a rather coarse grained way of looking at the frequency of words within texts.
The next thing to do seems to be to look at words.
© Peter W.H. Smith, March 2006