 Fascinating! (My apologies for not snipping, but I don't know how quickly other people delete received emails, and wanted to preserve the details that so interested me.)
 Louann: I guess with new analysis methods (how were
those methods tested, btw John?) you can get new information from old data

John: This isn't simple stuff, but, very nutshelled, taking the methods
devised and refined by MacDonald P. Jackson, Hugh Kinney, and others --

The statistical problems with the old subjective detection of echoes and
similarities, underlying attribution arguments, were that a critic could
only find such an echo in a work they had read (predetermination of sample)
and sample sizes (way too small, often radically unequal).

What changes the game is having -all- surviving printed drama, and all
surviving printed literature, in searchable e-form (and yes, there were
editorial protocols about transcription, generating consistency).

Using the accepted clean canon for author X (for Shax, excluding known
collaborations and anything dubious for whatever reason, that means 27
plays) you first generate lists of 500 words that are -more- likely to be
in that author's usage as against all other drama/lit, and similarly 500
words that are -less- likely. Frex, Shax uses 'yes' less often than 'yea'
or 'aye/I', whereas Fletcher uses 'yes' much more normatively ; and Shax
was keen on 'gentle', 'answer' and 'beseech', but did not like either
'sure', 'brave', or the plural 'hopes' (plenty of 'hope' but very few

Second, you create a test of function words (prepositions and conjunctions,
mostly) that one does not usually notice as patterned in reading, or
writing, but that exhibit consistent individual characteristics -- the
positioning, syntactically, of 'which' is one good one ; also use of 'if',
'and' as 'and' and 'and' as 'if' (!) &c.. An author's syntactical and
grammatical fingerprint, as it were.

So, suppose you want to test Hand D of *Sir Thomas More*. It has about 1200
words, so you divide the canons of possible authors (Shax, Munday, Jonson,
plus) into 1200-word chunks, and to each you apply a) a test regarding the
presence/absence of words on the two lists of 500, generating a numerical
score ; and b) the function words test, ditto. With two values for each
1200-word chunk, you plot a scattergraph, which produces a rough cluster
for each author that for the most part don't overlap much. Then you do the
same with the Hand D chunk -- and you find that it lies smack in the middle
of the Shax cluster and nowhere near the middle of anyone else's cluster.

There are other things one can do also. Plotting the two values of the
500-word lists can also produce sharp distinctions : with 27 plays by Shax
and 85 by his contemporaries divided into 1200-word segments and plotted
purely on preferred/disliked words, the graph shows two distinct clusters.
Plot the centroids of each cluster, join them with a line, and draw in a
perpendicular bisector creating a Shax side and an others side : 1265 of
the points are on the correct side of the bisector, and only 22 on the
wrong side (98% success). The Hand D sample is again central in the Shax

It's not QED, but it is -not- subjective in the old manner, deals
explicitly with both biased sampling and sample sizes, and involves
independent implications that converge. For Hand D it adds two sorts of
evidence -- for Shax, and against everyone else who has surviving plays
from the period. Ditto the countess scenes in *Edward III* and the central
argument scene in *Arden of Faversham* -- both consistently falling in a
Shax cluster and -not- in an anyone else cluster.

For more (and it is readable, though some of the maths made my head hurt),
and with fascinating scattergraphs, see:
Hugh Craig and Arthur Kinney, eds, *Shakespeare, Computers, and the Mystery
of Authorship* (Cambridge UP, 2009)

(Similar function-word tests have, I believe, been used in courts to show
e.g that an alleged confession by an individual demonstrably has multiple

