Stylometric analysis involves the use of statistical
methods to attempt to ascribe authorship to disputed texts. Lets
use a simple example: We have a text A that is indisputably by author A. We
have a text B that is indisputably by author B and we have a disputed text C, that we know to be either written by A or written by B.
Then a stylometric analysis test will take the three texts and
(ideally) inform us whether text C was written by author A or by author B. How
does it work? Well, let’s attempt to reduce the problem down to its simplest
form.
Suppose that the test
places texts A and B on the ends of a line marked by gradations from 1 to 10.
1
2 3
4 5 6
7 8 9 10
_____________________
A B
Now we carry out the same test on text C – where will it go? If it also
places it at position 1, then we might ascribe it to author A and if it places
it at position 10, then we might ascribe it author B. But what if it is placed
elsewhere, e.g. is it still by author A if it is placed at position 2? Or is it
still by author B if it is placed at position 9? What if it is placed at
position 5?
Let’s suppose that we also examined 50 other texts by author A and 50 by
author B. All texts by author A occurred at positions 1 or 2 and all by author
B occurred at 9 or 10 (ideal). Then we could use this data as a strong
indicator of authorship – unless, of course there is the possibility that the
text was actually written by someone else completely.
The difficulty arises when either author A and
author B are very close together on the scale that we have chosen or quite
simply, when introduce more and more authors and they start to populate the
linear scale.
In their now famous study of The Federalist Papers,
Mosteller and Wallace (1964) complained about the
fact that James Madison and Alexander Hamilton who wrote most of The Federalist
Papers had average sentence lengths of 34.55 and 34.59 words respectively with
standard deviations of 19.2 and 20.3 words – hopeless for discriminating
between the two authors.
There are two problems here:
1.
The fact
that with a one-dimensional scale, we are forced to place every author onto
that scale, especially if we are searching for an unknown author.
2.
We don’t
know how many other authors there are out there lurking that have the same or a
very similar measure to the one chosen.
Let’s consider these
problems in turn:
The one-dimensional scale – This is useful if, the measure discriminates
between authors A and B – i.e. it spreads out the works of authors A and B
widely and reliably and only if our disputed text C is known to have been written
by either author A or author B and no other. Otherwise we are forced to
consider measuring and discriminating between our authors on a 2 or more
dimensional scale.
How many authors? – How do we know that our measure is accurate
and is capable of discriminating between different writers? One way of
determining this is to carry out a comprehensive study using hundreds of
different writers and if they are all neatly separated out, then you can with
some justification call your measure successful.
So what kind of simple,
linear measures are there for authorship?
So what can we learn from word length frequency profiles? – Word distributions follow
the Zipf’s
Law
which means that a few frequently occurring words will dominate the word frequency
profile. Word length frequency is a rather coarse grained way of looking at the
frequency of words within texts.
The next thing to do seems to be to look at words.
Word Based Stylometric
Analysis
© Peter W.H. Smith, March 2006