Monday, March 26, 2012

NTA0: SampleTextForAnalysis

Numerical Text Analysis 0
Guide to all Text Analysis posts

Try and guess the text I am going to use as an illustrative example. At a certain stage of text-processing, it has somewhat fewer than 2,000 word-tokens and a total length of a bit less than 27,000 words. I rank these 2,000 word tokens by order of frequency of appearance in the text. Here are some of the salient tokens and their ranks in increasing order: scan the tokens one at a time, and stop when you guess the text they are from. Give your points corresponding to the rank of the word you stopped at and report back to me via a comment.


Here is more information about the words:


The word 'head' appears only 4/5 of the times that the word 'queen' does. Assuming that 'head' appears half the time in neutral circumstances, the Queen is only yelling “Off with his head!” 2/5 of the times she makes an appearance.

The full text can be found here.

1 comment:

  1. i guessed it was probably alice in wonderland at the third word, "caterpillar," and was certain at the fifth, "duchess." i'd normally call it lucky, but, really, who could be familiar with the work and see a list with "tea" and "caterpillar" in the first three words by frequency and not think of alice in wonderland?

    ReplyDelete