Guide to all Numerical Text Analysis Posts
For the corpus, or the set of documents I am going to analyse, I've chosen the top 100 books over the last 30 days (as of 3/27/2012) from Project Gutenberg. (Incidentally, Vatsayana's Kamasutra is a perennial number 3, and this accords with Amartya Sen's footnote (pg. 25, The Contemplative Indian): In fairness to Western expertise on India, it must be conceded that there has never been any lack of Occidental interest in what may be called the 'carnal sciences', led by the irrepressible Kamasutra and Anangaranga.) I looked at the source code for the page, copied it, and did regular expression searches for the serial numbers of the books, then wrote Python code to suck them from the website, keeping my fingers crossed that they wouldn't think I = Robot and block my IP address. Some of the .txt files turned out to be empty. I chopped off the beginning (Project Gutenberg info.) and the last bit ( license info.) for each document.
As a reminder, the steps for processing text are:
- Lowercase, depunctuate and stem the words in the text of each document.
- Construct the histogram for the corpus: for each token, add the occurrence frequencies in all the documents in the corpus.
- Identify the rare tokens – those that occur a total of only one time in the entire corpus. Why remove any low-frequency tokens at all? There are a large number of non standard-English letters (possibly some of the documents are in languages other than English or there may be a few words from other languages or simple mitsakes) and non-ASCII and non-UTF-8 punctuation marks (for example Sanskrit diacritical notation from the book ranked number 3 above) that my algorithm doesn't pick up.
I thought that restricting the allowed atoms to lowercase unaccented Latin letters would rule out too many accented letters, and played with removing small frequency words. It turns out that even for a 100 documents, ruling out the unipresent tokens is enough since the remaining bipresent tokens seem quite reasonable.
- Identify the small tokens, that is, the single character ones except “a” and “i”, and put both the small and the rare tokens in a list of raw words to be removed. The few small tokens can be removed from all the documents in the corpus, but since the rarity of a token can change as documents are added to the corpus the rare tokens should be left in place, even though the number of rare tokens will increase almost linearly with the number of words in the corpus whereas the number of higher frequency words will increase much slower.
From which you can see
- The number of unipresent tokens is about half of all tokens, but represent marginally few words.
- There are a surprisingly large number (132) of unidentified atoms, which on average occur multiple times.
- If you think that English is a gender-neutral language, think again.
- which should be used to determine the minimum rank to exclude “stop” words (deviations from flatness at the high frequency end. Let's take a look at the high frequency end in more detail:
from which I can see choosing a stop rank anywhere between 50 and a 150, but I don't see at all why the stop rank should be between 450 and 500 words, which is what standard stop-word lists consist of. Ignorance => opportunity to learn. However, for now I am going to go with the list of the top-ranked 103 as the stopword list, which reverentially saves "god" "himself", which come in at #s124th and 125th.
- Add the set of stop words to the set of words to be removed. The complement of this subset in the set of all tokens in the corpus is the set of tokens to be used for the statistical calculations. The truncations of the histograms to this “use-set” of words are what we will use for studying Cosine Similarity, its alternatives, and what should really be done with the histograms.
So in the next post, I'll lay out my arguments comparing the ubiquitous Cosine Similarity to some simple trignometric alternatives, with some simple illustrative examples. The post after that will consider document similarity within the above corpus using both Cosine and my proposed alternative.
Next what I'll propose is to go beyond this approach to similarity, to really situate ourselves well in word-vector space and see what really good measures we can construct there, and of course, following all that theory, apply those ideas to the documents in the corpus. Yum-yum!