Identifying authors algorithmically
July 2009 (index)
We implement a "Bayesian classifier" to be able to distinguish different types of documents. "Classifier" because it tries to decide which class a document belongs to; and "Bayesian" because it makes use of some maths first developed by Thomas Bayes.
The plan is something like this:
(i) Take a set of test documents and decide which class each of them belongs to.
(ii) Present these documents to the classifier so it can learn useful things about them using some clever maths.
(iii) Then show the classifier some documents it has never seen before and let it make a pretty good guess at which class they belong to.
A classic use of this is as a spam filter, where the two classes are "spam" and "not-spam". The person receiving the email tells the computer whether items are spam or not and it learns how to distinguish them.
We're going to test it on a different problem. Given extracts from English fiction can it distinguish between different books.
We won't describe the theory as it's been covered elsewhere. If you're interested then see eg the Wikipedia article Naive Bayes classifier.
Does it work?
The two texts we use are "Pride and Prejudice" (PP) by Jane Austen and "The Chronicles of Clovis" (CC) by Saki.
We first split each text into paragraphs and then reject paragraphs of less than 20 words. Then we randomly pick 20% of the remaining paragraphs to be our "training data" and the other 80% to be "test data". The classifier gets to learn from the "training data" and we see how well it does in classifying the "test data" that it hasn't seen before.
The classifier tells us the probability that the paragraph is of a particular class. We'll call it successful if the probability of the correct class is over 50%. So it can only predict PP or CC: no "don't knows".
The scores will vary slightly depending on which 20% of the paragraphs we randomly choose as training texts. For a typical run we have:
pp : 1124 test cases : 36 errors : 96.8% success
cc : 427 test cases : 17 errors : 96.0% success
What affects it?
We can get a feel for what influences good predictions.
The strongest correct prediction for PP is for a lengthy 141-word paragraph beginning:
With proper civilities the ladies then withdrew; all of them equally surprised that he meditated a quick return. Mrs. Bennet wished to understand by it that he thought of paying his addresses to one of her younger girls, and Mary might have been prevailed on to accept him. ...
We can calculate the five words which most strongly influenced the prediction in this case. They are:
Elizabeth, Bennet, ladies, soon, girls
which are all typical of PP.
Similarly we compare word frequencies to see which words are most helpful in distinguishing the books. The 20 words that give the strongest evidence for CC are:
Clovis I'm farm Baroness I've Bertie beast Tobermory Vespaluus Belturbet Groby Brimley Huddle Mortimer Stoner won't Conradin It's Packletide Susan
And the same for PP:
Elizabeth Darcy Bennet Bingley Wickham Collins Lydia Jane Catherine Longbourn feelings happiness ladies Netherfield Lizzy Gardiner Charlotte uncle Meryton Kitty
For both books the majority of the most important words are names. If we restrict the analysis to words without capital letters then the results are:
pp : 1071 test cases : 40 errors : 96.3% success
cc : 393 test cases : 41 errors : 89.6% success
And the 20 most influential words are now:
CC: farm beast won't goes it's tree van food he's hyaena meal obvious parrot tiger wouldn't don't big bird dramatic entire
PP: feelings happiness ladies uncle manners civility assure girl sisters cried expect resolved answered consequence affection society delight appear attachment mother's
PP leans towards emotions and family; CC towards animals and apostrophes. This has the makings of a parlour game: given the twenty words, can you guess the book? One for another day perhaps.
Other authors
As a final test we can consider whether the classifier is learning the style of the author or the subject matter of the book.
We train on the whole of CC and PP, and then test on another Austen novel "Sense and Sensibility" (SS).
pp : 1168 test cases : 48 errors : 95.9% success
cc : 1168 test cases : 1120 errors : 4.1% success
Of 1168 paragraphs (of more than 20 words) in SS we find that 96% are classified as PP and just 4% as CC. This seems to show that the classifier is not just learning subject matter but also writing style.
We can even try using a book by a different author altogether: PG Wodehouse's "Mike and PSmith" (MPS).
pp : 618 test cases : 553 errors : 10.5% success
cc : 618 test cases : 65 errors : 89.5% success
Of the 618 paragraphs we find that 90% are classified as CC and 10% as PP. So the classifier thinks that the early 20th century comic writing of PG Wodehouse is more like the early 20th century comic writing of Saki than the early 19th century Austen.
Final thoughts...
Despite having read about a Bayesian classifier before I was surprised by just how well the basic algorithm performed. While being able to say that Wodehouse is like Saki isn't exactly the pinnacle of literary criticism, the fact that a computer can do so in a few lines of code is remarkable.
The results here are good, using just a basic approach. To get to 99% or 99.9..% accuracy requires more work. Paul Graham has written about applying this technique as a spam filter and some of the modifications he made to get good results.
ibbly.com contact